GWAS Analysis



By Rameez Syed

Supervisor:
Chelsie E. Benca-Bachman PhD
Rohan H.C. Palmer PhD

Table of Contents

Getting Started

What is Jupyter Notebook?

"Notebook" = Documents that contain both code and text elements (e.g., figures, links, equations).

Because of the mix of code and text elements, they are:

  • Ideal place to bring together an analysis description, and its results
  • Can be executed to perform the data analysis in real time.

Running cells

  1. Enter print("Hello World!") into the first cell
  2. Select the cell
  3. Select “Run” or shift enter
In [1]:
print("Hello World")
[1] "Hello World"
In [ ]:
version
In [2]:
# suppress_warning
# suppress_messages
library(tidyverse)
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

 ggplot2 3.3.5      purrr   0.3.4
 tibble  3.1.6      dplyr   1.0.8
 tidyr   1.2.0      stringr 1.4.0
 readr   2.1.2      forcats 0.5.1

── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()

In [3]:
head(mpg)
A tibble: 6 × 11
manufacturermodeldisplyearcyltransdrvctyhwyflclass
<chr><chr><dbl><int><int><chr><chr><int><int><chr><chr>
audia41.819994auto(l5) f1829pcompact
audia41.819994manual(m5)f2129pcompact
audia42.020084manual(m6)f2031pcompact
audia42.020084auto(av) f2130pcompact
audia42.819996auto(l5) f1626pcompact
audia42.819996manual(m5)f1826pcompact
In [4]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Add content to your notebook

a. Creating headers

To create a header in Markdown, all we need to do is use the # sign.
One sign makes a heading 1, two signs make a heading 2, and so on.

Header 1

Header 2

Header 3

Header 4

b. Making lists

We can create bullet points in Markdown using single asterisks in place of bullets and select “Run”.

  • list1
  • lis2

c. Highlist text with only HTML code

Here is an example of highlighted text. using the tag

Linux

In [ ]:
# Not Run
1)
ls -al

2)
head 1kg_EU_qc.bim 

3)
head 1kg_EU_qc.bim | 
grep  rs1806509

R

In [ ]:
# Not Run
1)
system("
ls -la
")

2)
system("
head 1kg_EU_qc.bim | \\
grep  rs1806509
")

Jupyter Notebook

In [11]:
setwd("/scratch/silo1/BGA_LAB/workshop/dataset")  
cat(system(" ", intern=TRUE), sep='\n')

# 1)
cat(system(
"ls -la", 
intern=TRUE), sep='\n')

# 2)
cat(system(
"head 1kg_EU_qc.bim | \\
grep  rs1806509", 
intern=TRUE), sep='\n')

setwd("/scratch/silo1/BGA_LAB/workshop/analysis/")
total 974686
drwxrws--- 5 rasyed2 bga_lab         23 Jul 18 16:24 .
drwxrws--- 8 rasyed2 bga_lab         13 Jul 20 11:07 ..
-rw-r----- 1 rasyed2 bga_lab   80851178 Jun  6 15:21 1kg_EU_qc.bed
-rw-r----- 1 rasyed2 bga_lab   23709500 Jun  6 15:21 1kg_EU_qc.bim
-rw-r----- 1 rasyed2 bga_lab       7201 Jun  6 15:21 1kg_EU_qc.fam
-rw-rw---- 1 rasyed2 bga_lab        752 Jun 13 13:41 1kg_EU_qc.log
-rw-rw---- 1 rasyed2 bga_lab   20305240 Jun 13 13:41 1kg_EU_qc.map
-rw-rw---- 1 rasyed2 bga_lab 1290221741 Jun 13 13:41 1kg_EU_qc.ped
-rw-r----- 1 rasyed2 bga_lab   41559309 Jun  6 15:31 BMI_LDpred.txt
-rw-rw---- 1 rasyed2 bga_lab        326 Jun 13 15:04 BMI_maf.log
-rw-r----- 1 rasyed2 bga_lab      22803 Jun  6 15:21 BMI_pheno.txt
-rw-r----- 1 rasyed2 bga_lab         98 Jun 13 15:04 BMI_rs2412493_maf.frq
-rw-rw---- 1 rasyed2 bga_lab        747 Jun 13 15:04 BMI_rs2412493_maf.log
-rwxr----- 1 rasyed2 bga_lab   77057069 Jun  6 15:30 BMI.txt
-rw-r----- 1 rasyed2 bga_lab  129396011 Jun  6 15:39 EA2_results.txt.gz
-rw-rw---- 1 rasyed2 bga_lab      91254 Jul 18 11:53 eigenvectors_EU_QC.txt
-rw-r----- 1 rasyed2 bga_lab   46903225 Jun  6 15:35 Giant_Height2018.txt.gz
drwxrws--- 3 rasyed2 bga_lab          3 Jun  7 10:36 ld_scores
drwxrws--- 2 rasyed2 bga_lab          8 Jun 10 11:42 OtherCase
-rw-rw---- 1 rasyed2 bga_lab        415 Jun 22 12:16 plink.log
-rw-rw---- 1 rasyed2 bga_lab        238 Jul 18 11:53 pve_EU_QC.txt
drwxrws--- 4 rasyed2 bga_lab          4 Jun 16 11:35 Reference
-rw-rw---- 1 rasyed2 bga_lab          0 Jun 15 12:28 tmp.txt
1	rs1806509	0	853954	C	A


This tutorial is written in R, and these are the packages you will need to install/load

In [7]:
library(data.table)
library(ggplot2)
library(bigsnpr)
library(remotes)
#library(corrplot)
#library(forcats)
Loading required package: bigstatsr

Other programs you will need are:

  • Plink
  • FlashPCA
  • GCTA
  • PRSICE
  • LDsc

1. Association

Objectives

  • Dataset explorer
  • Performing Principal Component analysis in genetic dataset
  • Running linear regression analysis
  • Running a genome-wide association (GWAS) study in Plink

Background Information

  • GWAS: a method to test common genetic variants across the exomes/genomes of many individuals to identify genotype-phenotype associations.
  • It quantifies statistical association between genetic variation and phenotypes.
  • GWAS aims to find those genetic markers at which variation in genotype is significantly associated with variation in phenotype.

  • Phenotype/trait: the set of observable characteristics of an individual resulting from the interaction of its genotype with the environment. For example:
    • Quantitative traits like standing height or concentration of cholesterol particles in circulation
    • Binary traits like diagnoses of multiple sclerosis or schizophrenia.
In [2]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/Figure_plink_files.PNG")

PLINK website:
https://zzz.bwh.harvard.edu/plink/

PLINK2 website:
https://www.cog-genomics.org/plink/2.0/

PLINK Manual:
https://zzz.bwh.harvard.edu/plink/dist/plink-doc-1.07.pdf

  • PLINK is a free, open-source whole genome association analysis toolset
  • Designed to perform a range of basic, large-scale analyses in a computationally efficient manner
  • Command-line program that works on Windows, Mac and Linux


Source:
PLINK has many functions including: those related to

  • Data Organization,
  • Formatting,
  • Quality Control,
  • Association Testing,
  • Population stratification and much more.
In [3]:
cat(system(
"plink --maf --help", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
--help present, ignoring other flags.

--maf [freq]        : Exclude variants with minor allele frequency lower than
                      a threshold (default 0.01).
--max-maf <freq>    : Exclude variants with MAF greater than the threshold.
--mac <ct>          : Exclude variants with minor allele count lower than the
  (alias: --min-ac)   given threshold.
--max-mac <ct>      : Exclude variants with minor allele count greater than
  (alias: --max-ac)   the given threshold.

Prepare basic information for GWAS and PGS

Target genome information: Information on the target to calculate PGS

Prepare independent GWAS analysis results for PGS analysis:

  • Obtained summary statistics data through public websites (dbGap) and checked for inclusion of required information.
  • Chromosome, position, allele information (effect, other allele), (beta), standard error, sample size, p-value.
  • The same genome build should be used for both panel and target genome information (hg 19).

Further reading and Resource

  • Marees, Andries T., et al. "A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis." International journal of methods in psychiatric research 27.2 (2018): e1608.
  • Anderson, Carl A., et al. "Data quality control in genetic case-control association studies." Nature protocols 5.9 (2010): 1564-1573.
  • Smoller, Jordan W. "The genetics of stress-related disorders: PTSD, depression, and anxiety disorders." Neuropsychopharmacology 41.1 (2016): 297-319.

Dataset

Gentotype: 1kg_EU_qc.bed, 1kg_EU_qc.bim and 1kg_EU_qc.fam
Phenotype: BMI_pheno.txt
Sumary statistics: BMI.txt, EA2_results.txt.gz and Giant_Height2018.txt.gz

Dataset Explorer

In [7]:
cat(system(
"wc -l /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim", 
intern=TRUE), sep='\n')
851065 /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim

In this example, we will consider a dataset with five individuals; each of them genotyped for four SNPs. The genotypes themselves are in numerical coding, 0 is reference homozygous, 1 is heterozygous and 2 is alternate homozygous.

In [5]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/genotype_count.PNG")
In [6]:
# # the function snp_readBed converts SNP alleles from A,C,G,T to numerical coding, 
# 0 is reference homozygous, 1 is heterozygous and 2 is alternate homozygous.
# snp_readBed("/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed")

obj.bigSNP <- snp_attach("/scratch/silo1/BGA_LAB/workshop/dataset/OtherCase/1kg_EU_qc_BMI.rds")
G <- obj.bigSNP$genotypes
G2 <- as.data.frame(G[1:379,])
colnames(G2) <- obj.bigSNP$map$marker.ID
tmp_dat <- cbind(obj.bigSNP$fam$affection,G2)
colnames(tmp_dat)[1] <- "BMI"
tmp_dat <- as.data.frame(tmp_dat)
#  There are 379 total number of samples and 851060 SNPs. 
dim(tmp_dat)
#tmp_dat[1:5,1:5]
  1. 379
  2. 851066
In [7]:
# Count each genotype code for rs2826862
table(tmp_dat$rs1048488)
#table(tmp_dat$rs1048488)
  0   1   2 
234 136   9 

Boxplot

In a boxplot, the box represents 50% of central data because it starts in the first quartile (25%) and ends in the third (75%). The line inside the box represents a median.

In [10]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/boxPlot.png")
In [11]:
boxplot(BMI ~ rs2826862, data = tmp_dat, xlab = "Number of Genotype",
   ylab = " ", main = " ")
In [8]:
help(boxplot)

Generate PCs

Covariate file and GWAS

  • GWAS results can be biased due to population stratification (PS) or admixture in which ancestry differences result in allele frequencies differences leading to spurious associations and familial relatedness (that occur because of data quality issues or statistical sampling) which can lead to false positives.

  • The most widely used method to correct bias due to PS is principal components analysis (PCA). Often, the ten PCs with the highest eigenvalues are included to adjust for PS.

  • Similarly, PCA is also the most common method to detect ancestral outliers, although other methods such as multidimensional scaling can equally be used

Calculate the PC

  • Generate the PCs using FlashPCA.
  • It is recommended to do LD pruning genotype dataset e.g removing the SNP correlation using 0.2 r^2 threshold before estimating the PC.
  • We can use LD pruning in our target dataset, or we can prune 1000 Genome dataset and overlap SNP with our target dataset.

LD-Pruning

A statistical procedure used to remove pair of correlated SNPs / Linked SNPs.

  • LD pruning selects only one random SNP from each LD block.
  • In each step the SNP with the higher MAF is kept in the dataset.
  • Independent SNPs = if their correlation (r^2) is inferior to 0.2 threshold.

Why perform LD-Pruning:

  1. Over -representation or double counting of causal variants can occur if LD regions are not removed.
  2. To select only independent SNPs.
  3. To get more robust results.

LD-Pruning with PLINK:

PLINK selects only one representative SNP from each LD block:

  • High MAF (MAF of at least 1%)
  • With no pairs remaining with r2 > 0.2
  • plink--file input--indep-pairwise 50 5 0.5--out tmp1
  • plink--file input--extract tmp1.prune.in--recode--out output

Linkage disequilibrium-based SNP pruning
--indep-pairwise 50 5 0.2

a) consider a window of 50 SNPs.
b) calculate LD between each pair of SNPs in the window.
c) remove one of a pair of SNPs if the LD is greater than 0.2.

In [22]:
## 1000 Genomes LD pruned data (50 10 .5) file

kg_bim <- fread("/projects/bga_lab/DATA_REPOSITORIES/IMP_PIPELINE_FILES/IMPUTE2_1KG_Reference_All_Biallelic_SNPS.bim")
dim(kg_bim) # 2050463       13
head(kg_bim)
  1. 2050463
  2. 6
A data.table: 6 × 6
V1V2V3V4V5V6
<int><chr><int><int><chr><chr>
1rs3107975 0 55326TC
1rs288630040526736CG
1rs6650104 0564477AG
1rs1988726 0564862TC
1rs7349153 0565490TC
1rs6421779 0566024GA
In [23]:
## Read in your target data .bim
data_bim <- fread("/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim")
head(data_bim)
print("Dimension")
dim(data_bim) # n variants

## Determine overlap between 1KG and target data
table(data_bim$V2 %in% kg_bim$V2 )


overlap <- data_bim$V2[data_bim$V2 %in% kg_bim$V2 ]
length(overlap) # 843424
fwrite(as.list(overlap), "data_updated_1kg_ref_snps.txt", sep = " ", quote = FALSE, col.names = FALSE)
A data.table: 6 × 6
V1V2V3V4V5V6
<int><chr><int><int><chr><chr>
1rs10484880760912CT
1rs31158500761147TC
1rs25190310793947GA
1rs49703830838555AC
1rs44756910846808TC
1rs18065090853954CA
[1] "Dimension"
  1. 851065
  2. 6
 FALSE   TRUE 
  7641 843424 
843424
In [24]:
##### Restrict datas to overlapping SNPs #####
## Trim down the data to match reference sample

cat(system(
"plink --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc \\
--extract data_updated_1kg_ref_snps.txt \\
--make-bed --out EU_qc_match_1kgRef ", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EU_qc_match_1kgRef.log.
Options in effect:
  --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc
  --extract data_updated_1kg_ref_snps.txt
  --make-bed
  --out EU_qc_match_1kgRef

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 217680 MB successfully, after larger attempt(s) failed.
851065 variants loaded from .bim file.
379 people (178 males, 201 females) loaded from .fam.
--extract: 843424 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 379 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
843424 variants and 379 people pass filters and QC.
Note: No phenotypes present.
--make-bed to EU_qc_match_1kgRef.bed + EU_qc_match_1kgRef.bim +
EU_qc_match_1kgRef.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
In [25]:
# it takes about 5 minutes to run 
cat(system(
"flashpca --bfile EU_qc_match_1kgRef --ndim 20 --suffix _EU_QC.txt --numthreads 100", 
intern=TRUE), sep='\n')

# Output
#pcs_EU_QC.txt # eigen vector
#pve_EU_QC.txt # eigen values
[Fri Jul 22 10:51:57 2022] arguments: flashpca flashpca --bfile EU_qc_match_1kgRef --ndim 20 --suffix _EU_QC.txt --numthreads 100 
[Fri Jul 22 10:51:57 2022] Start flashpca (version 2.1)
[Fri Jul 22 10:51:58 2022] blocksize: 602926 (1828071632 bytes per block)
[Fri Jul 22 10:51:58 2022] PCA begin
[Fri Jul 22 10:56:00 2022] PCA done
[Fri Jul 22 10:56:00 2022] Writing 20 eigenvalues to file eigenvalues_EU_QC.txt
[Fri Jul 22 10:56:00 2022] Writing 20 eigenvectors to file eigenvectors_EU_QC.txt
[Fri Jul 22 10:56:00 2022] Writing 20 PCs to file pcs_EU_QC.txt
[Fri Jul 22 10:56:00 2022] Writing 20 proportion variance explained to file pve_EU_QC.txt
[Fri Jul 22 10:56:00 2022] Goodbye!
In [27]:
cat(system(
"head /scratch/silo1/BGA_LAB/workshop/analysis/pve_EU_QC.txt", 
intern=TRUE), sep='\n')
0.008623494
0.004458956
0.003479823
0.003245354
0.003155571
0.003139071
0.003118007
0.003108016
0.003101836
0.00308807
In [9]:
cat(system(
"head /scratch/silo1/BGA_LAB/workshop/analysis/pcs_EU_QC.txt | head ", 
intern=TRUE), sep='\n')
FID	IID	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11	PC12	PC13	PC14	PC15	PC16	PC17	PC18	PC19	PC20
0	HG00096	0.01242365	-0.07595452	-0.01680644	0.0881174	0.0385509	-0.02607342	0.01687276	0.05083526	-0.06523377	-0.05294566	-0.04470348	-0.03587981	0.04837684	-0.0368848	-0.04937979	0.01630842	-0.003747645	-0.0282603	-0.05197128	0.02574301
0	HG00097	0.02261263	-0.09322247	-0.03750206	0.01745471	0.05698749	-0.05271887	0.001986322	0.08794831	-0.07104735	-0.1134111	0.03507826	0.02443423	-0.05342509	0.03147125	-0.07663653	-0.02306111	-0.04137738	0.000784083	-0.05740029	0.08012858
0	HG00099	0.006746184	-0.1040287	0.01814878	0.03193105	0.06067711	0.06962964	-0.03871903	0.0399715	0.01595159	-0.03458293	-0.02756342	-0.0201608	0.040726	-0.03949256	-0.1170923	0.04838187	0.0005748896	0.04083124	0.04767548	-0.08050239
0	HG00100	0.02456913	-0.1059152	-0.05258633	0.1284947	0.02699942	-0.01545383	0.03142333	-0.0009763567	-0.03597306	-0.01779976	0.0242158	0.05691472	0.1524257	-0.06823687	-0.07057563	0.00645599	-0.01892005	0.04042762	0.06381889	-0.08849938
0	HG00101	0.02085092	-0.09876345	-0.01970046	0.07797953	0.10389	0.005038879	0.056389	0.008460883	-0.07637942	-0.03571031	0.04132938	0.07350528	0.06672978	-0.01018131	-0.01531458	0.006411239	0.05957613	0.0324557	0.04336405	-0.07359367
0	HG00102	0.004682484	-0.09735554	-0.01354115	0.1285232	-0.007667986	-0.008087716	0.07357861	0.02897895	0.022137	-0.007800477	0.04960406	-0.03809981	0.06307478	-0.01263359	0.03897042	0.03054337	-0.088698	-0.004360761	-0.08808678	-0.004914086
0	HG00103	0.02010171	-0.0758244	-0.009842469	0.009010794	-0.08731564	0.05098554	0.03871191	-0.07211523	0.09918271	-0.01807769	-0.008292442	0.08721822	0.006815058	0.03803679	0.04125575	0.04074155	-0.08007255	0.01570215	-0.005974291	0.03575748
0	HG00104	0.01548017	-0.07064368	0.00277497	0.0186751	-0.004887796	-0.01383607	0.0558695	-0.03439123	-0.03908065	-0.01390377	0.02845204	0.05312429	0.003229626	0.1239419	0.03477645	-0.0388	0.007594813	0.07869259	0.03441025	-0.00448954
0	HG00106	0.01392874	-0.08604257	0.02806279	0.03426896	-0.04331729	0.0100231	0.002253311	-0.02165969	0.03656299	0.0005815769	-0.03136674	0.07104842	-0.03072697	0.1461936	-0.07704279	0.005830419	-0.05032821	0.02117162	0.02163948	-0.00717844

What is the eigen-value and eigen-vector?

  • Principal Component Analysis (PCA) is a method of reducing complex high dimensional data to few dimensions by ignoring the least important dimensions.
  • PCA compresses a very large number of data points down to smaller data points.
  • The new variables (V) are derived from the linear combination of existed variables (x).
  • The number of these principal components is always equal or less than the number of data originals variables, from which they are derived.
  • PCs (eigen-vector) are the new axes, and they provide information on the variation present in the data.
  • Among all the PCs, PC1 covers the most variation in the data and it is, therefore, the most important principal component. PC2 covers the second most variation in the data.
  • PCA also allows us to cluster the resulting pattern of area (population group) shown by data points.
In [29]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/pc.png")

How many PCs to select?

In [30]:
# Data Visualization: Scree Plot
eigen_value <- fread("pve_EU_QC.txt")
seq_PC <- seq(1,20)
eigen_value$PC <- as.factor(paste("PC",seq_PC,sep = ""))
eigen_value <- as.data.frame(eigen_value)
head(eigen_value)

ggplot(eigen_value, aes(x = fct_inorder(PC),y = V1)) + 
  geom_point()
A data.frame: 6 × 2
V1PC
<dbl><fct>
10.008623494PC1
20.004458956PC2
30.003479823PC3
40.003245354PC4
50.003155571PC5
60.003139071PC6
In [31]:
pc <- fread("pcs_EU_QC.txt")
head(pc)
A data.table: 6 × 22
FIDIIDPC1PC2PC3PC4PC5PC6PC7PC8PC11PC12PC13PC14PC15PC16PC17PC18PC19PC20
<int><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0HG000960.012423650-0.07595452-0.016806440.08811740 0.038550900-0.026073420 0.016872760 0.0508352600-0.04470348-0.03587981 0.04837684-0.03688480-0.04937979 0.016308420-0.0037476450-0.028260300-0.05197128 0.025743010
0HG000970.022612630-0.09322247-0.037502060.01745471 0.056987490-0.052718870 0.001986322 0.0879483100 0.03507826 0.02443423-0.05342509 0.03147125-0.07663653-0.023061110-0.0413773800 0.000784083-0.05740029 0.080128580
0HG000990.006746184-0.10402870 0.018148780.03193105 0.060677110 0.069629640-0.038719030 0.0399715000-0.02756342-0.02016080 0.04072600-0.03949256-0.11709230 0.048381870 0.0005748896 0.040831240 0.04767548-0.080502390
0HG001000.024569130-0.10591520-0.052586330.12849470 0.026999420-0.015453830 0.031423330-0.0009763567 0.02421580 0.05691472 0.15242570-0.06823687-0.07057563 0.006455990-0.0189200500 0.040427620 0.06381889-0.088499380
0HG001010.020850920-0.09876345-0.019700460.07797953 0.103890000 0.005038879 0.056389000 0.0084608830 0.04132938 0.07350528 0.06672978-0.01018131-0.01531458 0.006411239 0.0595761300 0.032455700 0.04336405-0.073593670
0HG001020.004682484-0.09735554-0.013541150.12852320-0.007667986-0.008087716 0.073578610 0.0289789500 0.04960406-0.03809981 0.06307478-0.01263359 0.03897042 0.030543370-0.0886980000-0.004360761-0.08808678-0.004914086
In [32]:
# phenotype 
cat(system("head /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt", intern=TRUE),sep='\n')
FID	IID	BMI
0	HG00096	25.022827	
0	HG00097	24.853638	
0	HG00099	23.689295	
0	HG00100	27.016203	
0	HG00101	21.461624	
0	HG00102	20.673635	
0	HG00103	25.71508	
0	HG00104	25.252243	
0	HG00106	22.765049	
In [33]:
pheno <- fread("/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt")
pheno <- pheno[,c(2,3)]
colnames(pheno) <- c("IID","BMI")
head(pheno)
dim(pheno)
Warning message in fread("/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt"):
“Detected 3 column names but the data has 4 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.”
A data.table: 6 × 2
IIDBMI
<chr><dbl>
HG0009625.02283
HG0009724.85364
HG0009923.68930
HG0010027.01620
HG0010121.46162
HG0010220.67364
  1. 1092
  2. 2
In [34]:
pheno_pc <- merge(pheno,pc, by = "IID")
head(pheno_pc)
A data.table: 6 × 23
IIDBMIFIDPC1PC2PC3PC4PC5PC6PC7PC11PC12PC13PC14PC15PC16PC17PC18PC19PC20
<chr><dbl><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
HG0009625.0228300.012423650-0.07595452-0.016806440.08811740 0.038550900-0.026073420 0.016872760-0.04470348-0.03587981 0.04837684-0.03688480-0.04937979 0.016308420-0.0037476450-0.028260300-0.05197128 0.025743010
HG0009724.8536400.022612630-0.09322247-0.037502060.01745471 0.056987490-0.052718870 0.001986322 0.03507826 0.02443423-0.05342509 0.03147125-0.07663653-0.023061110-0.0413773800 0.000784083-0.05740029 0.080128580
HG0009923.6893000.006746184-0.10402870 0.018148780.03193105 0.060677110 0.069629640-0.038719030-0.02756342-0.02016080 0.04072600-0.03949256-0.11709230 0.048381870 0.0005748896 0.040831240 0.04767548-0.080502390
HG0010027.0162000.024569130-0.10591520-0.052586330.12849470 0.026999420-0.015453830 0.031423330 0.02421580 0.05691472 0.15242570-0.06823687-0.07057563 0.006455990-0.0189200500 0.040427620 0.06381889-0.088499380
HG0010121.4616200.020850920-0.09876345-0.019700460.07797953 0.103890000 0.005038879 0.056389000 0.04132938 0.07350528 0.06672978-0.01018131-0.01531458 0.006411239 0.0595761300 0.032455700 0.04336405-0.073593670
HG0010220.6736400.004682484-0.09735554-0.013541150.12852320-0.007667986-0.008087716 0.073578610 0.04960406-0.03809981 0.06307478-0.01263359 0.03897042 0.030543370-0.0886980000-0.004360761-0.08808678-0.004914086

# of PCs (Method 2)

In [35]:
# PC1s
pc_list <- NULL

PC1_linear <- lm(BMI ~ PC1, data = pheno_pc,na.action=na.exclude)
pc_list[1] <- summary(PC1_linear)$r.squared
#  

PC2_linear <- lm(BMI ~ PC1 + PC2, data = pheno_pc,na.action=na.exclude)
pc_list[2] <- summary(PC2_linear)$r.squared
# 

PC3_linear <- lm(BMI ~ PC1 + PC2 + PC3, data = pheno_pc,na.action=na.exclude)
pc_list[3] <- summary(PC3_linear)$r.squared
# 

PC4_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4, data = pheno_pc,na.action=na.exclude)
pc_list[4] <- summary(PC4_linear)$r.squared


PC5_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 , data = pheno_pc,na.action=na.exclude)
pc_list[5] <- summary(PC5_linear)$r.squared


PC6_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 , data = pheno_pc,na.action=na.exclude)
pc_list[6] <- summary(PC6_linear)$r.squared


PC7_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 , data = pheno_pc,na.action=na.exclude)
pc_list[7] <- summary(PC7_linear)$r.squared


PC8_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 , data = pheno_pc,na.action=na.exclude)
pc_list[8] <- summary(PC8_linear)$r.squared


PC9_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 , data = pheno_pc,na.action=na.exclude)
pc_list[9] <- summary(PC8_linear)$r.squared


PC10_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 , data = pheno_pc,na.action=na.exclude)
pc_list[10] <- summary(PC8_linear)$r.squared


PC11_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 , data = pheno_pc,na.action=na.exclude)
pc_list[11] <- summary(PC8_linear)$r.squared


PC12_linear <- lm(BMI ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11+ PC12 , data = pheno_pc,na.action=na.exclude)
pc_list[12] <- summary(PC8_linear)$r.squared
In [36]:
PC_R2 <- as.data.frame(pc_list)
seq_PC <- seq(1,12)
PC_R2$PC <- fct_inorder(paste("PC",seq_PC,sep = ""))
PC_R2 <- as.data.frame(PC_R2)
colnames(PC_R2)[1] <- "R_2"
head(PC_R2)
A data.frame: 6 × 2
R_2PC
<dbl><fct>
12.845031e-05PC1
22.474114e-04PC2
31.470535e-03PC3
45.268411e-03PC4
55.404808e-03PC5
65.657399e-03PC6
In [37]:
ggplot(PC_R2, aes(x = PC,y = R_2)) + 
  geom_point()

PCA plot

In [38]:
thousand_index <- fread("/projects/bga_lab/DATA_REPOSITORIES/1000Genome/1000Genome_labels.csv")
head(thousand_index); dim(thousand_index)
A data.table: 6 × 4
IDPOPGROUPSEX
<chr><chr><chr><chr>
HG00096GBREURmale
HG00097GBREURfemale
HG00099GBREURfemale
HG00100GBREURfemale
HG00101GBREURmale
HG00102GBREURfemale
  1. 2504
  2. 4
In [39]:
thous_pc <- merge(thousand_index,pc,by.x = "ID", by.y = "IID", sort = F)
head(thous_pc); dim(thous_pc)
A data.table: 6 × 25
IDPOPGROUPSEXFIDPC1PC2PC3PC4PC5PC11PC12PC13PC14PC15PC16PC17PC18PC19PC20
<chr><chr><chr><chr><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
HG00096GBREURmale 00.012423650-0.07595452-0.016806440.08811740 0.038550900-0.04470348-0.03587981 0.04837684-0.03688480-0.04937979 0.016308420-0.0037476450-0.028260300-0.05197128 0.025743010
HG00097GBREURfemale00.022612630-0.09322247-0.037502060.01745471 0.056987490 0.03507826 0.02443423-0.05342509 0.03147125-0.07663653-0.023061110-0.0413773800 0.000784083-0.05740029 0.080128580
HG00099GBREURfemale00.006746184-0.10402870 0.018148780.03193105 0.060677110-0.02756342-0.02016080 0.04072600-0.03949256-0.11709230 0.048381870 0.0005748896 0.040831240 0.04767548-0.080502390
HG00100GBREURfemale00.024569130-0.10591520-0.052586330.12849470 0.026999420 0.02421580 0.05691472 0.15242570-0.06823687-0.07057563 0.006455990-0.0189200500 0.040427620 0.06381889-0.088499380
HG00101GBREURmale 00.020850920-0.09876345-0.019700460.07797953 0.103890000 0.04132938 0.07350528 0.06672978-0.01018131-0.01531458 0.006411239 0.0595761300 0.032455700 0.04336405-0.073593670
HG00102GBREURfemale00.004682484-0.09735554-0.013541150.12852320-0.007667986 0.04960406-0.03809981 0.06307478-0.01263359 0.03897042 0.030543370-0.0886980000-0.004360761-0.08808678-0.004914086
  1. 364
  2. 25
In [40]:
table(thous_pc$POP, useNA = "ifany")
CEU FIN GBR IBS TSI 
 84  89  81  14  96 
In [41]:
ggplot(thous_pc, aes(x=PC1, y=PC2, color=POP)) +
  geom_point(size=1.8) + guides(color = guide_legend(override.aes = list(size = 10))) + 
  theme_classic()
#ggsave("data_pc2vpc1.png", width = 5, height = 5, units = "in")

Relateness

Identity-by-state (IBD)

  • Over M markers, the IBS between the ith and jth individuals is given by

    $${IBS}_{ij} = 1 - \frac{1}{2M} \sum \limits _{k} \lvert{{G}_{ik} - {G}_{jk}}\lvert $$
    where $ {G}_{ik} $ denotes the number of minor alleles (0,1,2) carried by the ith individuals at SNP k
  • Identicals samples will share IBS near to 100%
  • Relative individuals will share higher IBS than unrelated individuals
In [42]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/relative.png")
  • For the relativeness between individuals, plink can give false positive associations where mixed linear model can correct the association analysis.
  • Try re-running the analysis using MLMA or FaST-LMM, which estimates and accounts for the relatedness between individuals.
  • If you want to use plink assoc linear/logistic, it is recommended to remove all related samples. However, if we want to keep relatives, GCTA-GREML is advised.
  • MLMA-LOCO analyses are powerful approaches to assess DNA associations with human traits (fixed effect) and assume a linear model while adjusting for population structure by estimating genomic relatedness matrices (random effect).
In [43]:
cat(system(
"plink \\
--bfile EU_qc_match_1kgRef \\
--genome \\
--out EU_IBD", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EU_IBD.log.
Options in effect:
  --bfile EU_qc_match_1kgRef
  --genome
  --out EU_IBD

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 217680 MB successfully, after larger attempt(s) failed.
843424 variants loaded from .bim file.
379 people (178 males, 201 females) loaded from .fam.
Using up to 159 threads (change this with --threads).
Before main variant filters, 379 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
843424 variants and 379 people pass filters and QC.
Note: No phenotypes present.
IBD calculations complete.  
Finished writing EU_IBD.genome .
In [44]:
IBD_data <- read.table("EU_IBD.genome", header=T)
head(IBD_data)
A data.frame: 6 × 14
FID1IID1FID2IID2RTEZZ0Z1Z2PI_HATPHEDSTPPCRATIO
<int><chr><int><chr><chr><int><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl>
10HG000960HG00097OT00.97060.029400.0147-10.7464140.59092.0145
20HG000960HG00099OT00.97100.029000.0145-10.7485800.87922.0753
30HG000960HG00100OT00.97230.027700.0139-10.7468870.61132.0179
40HG000960HG00101OT00.95720.042800.0214-10.7471260.42551.9883
50HG000960HG00102OT00.97580.024200.0121-10.7480110.68832.0311
60HG000960HG00103OT00.99120.008800.0044-10.7461540.39321.9831
In [ ]:
#examine file
#http://zzz.bwh.harvard.edu/plink/ibdibs.shtml

#FID1      Family ID for first individual
#IID1      Individual ID for first individual
#FID2      Family ID for second individual
#IID2      Individual ID for second individual
#RT        Relationship type given PED file
#EZ        Expected IBD sharing given PED file
#Z0        P(IBD=0)
#Z1        P(IBD=1)
#Z2        P(IBD=2)
#PI_HAT    P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )
#PHE       Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs)
#DST       IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs )
#PPC       IBS binomial test
#RATIO     Of HETHET : IBS 0 SNPs (expected value is 2)
In [45]:
ggplot(IBD_data,aes(x=Z0, y=Z1, colour=PI_HAT), fill = PI_HAT) +
ggtitle("Z0 by Z1")+coord_cartesian(xlim=c(-0.05,1),ylim=c(-0.05,1))+
geom_point(size=4) +scale_colour_gradient(limits=c(0, 1), low="blue",high="red")

Single Variant

With the LM, the variability in one variable is explained by the change in one or more other variables. For example BMI is the response variable and genotype rsID is the explanatory variable. The model explain the connection between the response and the explanatory variables.

In [46]:
boxplot(BMI~ rs2412493, data = tmp_dat, xlab = "Number of Genotype",
   ylab = " ", main = " ")
abline(lm(BMI ~ rs2412493, data = tmp_dat))
In [47]:
summary(lm(BMI ~ rs2412493, data = tmp_dat))
Call:
lm(formula = BMI ~ rs2412493, data = tmp_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9807 -1.9920  0.2165  1.9415  9.8083 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.5659     0.1657 148.281  < 2e-16 ***
rs2412493    -1.5716     0.3163  -4.969 1.02e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.858 on 377 degrees of freedom
Multiple R-squared:  0.06147,	Adjusted R-squared:  0.05898 
F-statistic: 24.69 on 1 and 377 DF,  p-value: 1.022e-06
In [48]:
table(tmp_dat$rs2412493)
  0   1   2 
293  80   6 
In [49]:
which(data_bim$V2 == "rs2412493")
#213875*2

#awk -F " " '{print $427755,$427756}' /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.ped | sort | uniq -c

#    293 A A     0
#     80 G A     1
#      6 G G     2
213875

Interpretation

We see that the phenotype varies with genotype in such a way A increase (and G decrease) the level of BMI.

Next step, we are going to test significance of coefficient at 5% and explain the meaning of the statistical significance of the coeffiecent

Hypothesis test for coefficients:

$${H}_{0} : {\beta = 0} \:\:VS \:\:{H}_{1} : {\beta \neq 0} \:\:\:p-value = 1.02e-06 $$
We reject ${H}_{0}$
There is enough evidence to suggest that sample who have rsid is negatively associated with BMI.

In [50]:
cat(system(
"plink \\
--bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc \\
--snp rs2412493 \\
--linear --assoc \\
--pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt \\
--out BMI_rs2412493 ", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to BMI_rs2412493.log.
Options in effect:
  --assoc
  --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc
  --linear
  --out BMI_rs2412493
  --pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt
  --snp rs2412493

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 217680 MB successfully, after larger attempt(s) failed.
1 out of 851065 variants loaded from .bim file.
379 people (178 males, 201 females) loaded from .fam.
379 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 379 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
1 variant and 379 people pass filters and QC.
Phenotype data is quantitative.
Writing QT --assoc report to BMI_rs2412493.qassoc ... 0%done.
Writing linear model association results to BMI_rs2412493.assoc.linear ...
0%done.
In [51]:
cat(system(
"head BMI_rs2412493.assoc.linear", 
intern=TRUE), sep='\n')
 CHR         SNP         BP   A1       TEST    NMISS       BETA         STAT            P 
   4   rs2412493   54638699    G        ADD      379     -1.572       -4.969    1.022e-06
In [ ]:
Call:
lm(formula = BMI ~ rs2412493, data = tmp_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9807 -1.9920  0.2165  1.9415  9.8083 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.5659     0.1657 148.281  < 2e-16 ***
rs2412493    -1.5716     0.3163  -4.969 1.02e-06 ***
---
Signif. codes:  0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 2.858 on 377 degrees of freedom
Multiple R-squared:  0.06147,	Adjusted R-squared:  0.05898 
F-statistic: 24.69 on 1 and 377 DF,  p-value: 1.022e-06

All Variants

In [54]:
# it takes about 5 minutes to run 
cat(system(
"plink --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc \\
--pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt \\
--assoc --linear \\
--hide-covar --covar pcs_EU_QC.txt \\
--covar-name PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \\
--out BMI_gwas_plink", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to BMI_gwas_plink.log.
Options in effect:
  --assoc
  --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc
  --covar pcs_EU_QC.txt
  --covar-name PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
  --hide-covar
  --linear
  --out BMI_gwas_plink
  --pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt

Note: --hide-covar flag deprecated.  Use e.g. "--linear hide-covar".
1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 217680 MB successfully, after larger attempt(s) failed.
851065 variants loaded from .bim file.
379 people (178 males, 201 females) loaded from .fam.
379 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
--covar: 10 out of 20 covariates loaded.
Before main variant filters, 379 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
851065 variants and 379 people pass filters and QC.
Phenotype data is quantitative.
Writing QT --assoc report to BMI_gwas_plink.qassoc ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
Writing linear model association results to BMI_gwas_plink.assoc.linear ...
0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
In [55]:
cat(system(
"head BMI_gwas_plink.assoc.linear", 
intern=TRUE), sep='\n')
 CHR         SNP         BP   A1       TEST    NMISS       BETA         STAT            P 
   1   rs1048488     760912    C        ADD      379     0.6842        2.377      0.01797
   1   rs3115850     761147    T        ADD      379     0.7046        2.408      0.01654
   1   rs2519031     793947    G        ADD      379    -0.7763      -0.8297       0.4072
   1   rs4970383     838555    A        ADD      379   -0.02005     -0.07888       0.9372
   1   rs4475691     846808    T        ADD      379    -0.3237       -1.155       0.2489
   1   rs1806509     853954    C        ADD      379   -0.04673      -0.2157       0.8293
   1   rs7537756     854250    G        ADD      379    -0.1025      -0.3722         0.71
   1  rs28576697     870645    C        ADD      379     0.1975       0.8407       0.4011
   1   rs7523549     879317    T        ADD      379    0.08359       0.1417       0.8874
In [13]:
assoc <- fread("BMI_gwas_plink.assoc.linear")
assoc <- assoc[order(assoc$P),]
head(assoc)
A data.table: 6 × 9
CHRSNPBPA1TESTNMISSBETASTATP
<int><chr><int><chr><chr><int><dbl><dbl><dbl>
4rs2412493 54638699GADD379-1.672-5.1135.120e-07
4rs2590767 54618574TADD379-1.633-4.9601.081e-06
3rs6775393188504998TADD379 1.074 4.8781.597e-06
6rs4333386169415956CADD379-1.189-4.7253.280e-06
3rs6778476188544680GADD379 1.009 4.6364.934e-06
6rs655307 169427629AADD379-1.231-4.6105.562e-06

Correcting for multiple testing in a GWAS

  • In GWAS, a statistical test is performed for each genetic locus and the phenotype to produce a test statistics and associated p-value.
  • FALSE POSITIVE: If we took the standard p-value of 0.05 (i.e., 5%), even if a given genetic variants was not associated with our phenotype, there is a 1 in 20 chance that we would find a significant association. This is what is referred to as false positive.
  • MULTIPLE TESTING: Because association studies can include a high number of tests (here, we have only tested 1 SNPs but in a genome-wide association study we usually test about 1 million SNPs), we should always adjust our significance thresholds for the number of tests performed.
  • BONFERRONI CORRECTION: One way of adjusting for the number of tests completed is to divide your significance threshold by the number of tests performed. This is known as a Bonferroni correction.
  • we divide the chosen significance threshold (p-value) by the number of tests that are performed. If 1 million tests were performed, the Bonferroni corrected p-value of significance is p < 5 X10^-8. P < 5X10-6 is the standard for genome-wide statistically significant and suggestive hits.

Interpret GWAS results

Manhattan Plot

  • A scatter plot used to display large number of data points (SNPs) and their significance in GWAS
  • Each data point is a SNP (not gene)
  • X-axis: genomic position
  • Y-axis: negative logarithm of the association p-value
In [60]:
library(qqman)
manhattan(assoc, chr="CHR", bp="BP", snp="SNP", p="P" , genomewideline = FALSE, suggestiveline = FALSE )
In [4]:
help(package="qqman")
In [10]:
help(manhattan,package="qqman")
In [61]:
qq(assoc$P)
In [62]:
median(qchisq(assoc$P, df=1, lower.tail=FALSE)) / qchisq(0.5, 1)
# Genomic inflation est. lambda (based on median chisq) = 1.
1

A value close to 1 suggest the data have been properly adjusted for the population substructure. If lambda > 1.2, this suggest the presence of stratification.

Exercise Question:

We examined European ancestry population.
From the plink website, try to find Chinese and Japanese sample (CHB and JPT) and perform:

  1. Association analysis using the linear regression using PLINK

2. Gene-based / Gene set analysis

Background Information

In [50]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/gene.png")

Gene-based analysis

Genetic marker data is aggregated to the level of whole genes and tests for the joint association of all markers/SNPs in the gene with the phenotype.
Instead of testing single SNPs, we test for joint association effect of all SNPs in a gene.

Gene-set analysis:

Individual genes are aggregated to groups of genes sharing certain biological, functional or other characteristics.

Advantages:

  1. Reduces the number of tests
  2. makes it possible to detect multiple weaker associations
  3. Provide insight into the involvement of specific biological pathways or cell functions

Types:

  1. self-contained analysis: tests whether the gene set contains any association at all
  2. competitive analysis: tests whether the association in the gene set is greater than in other genes

MAGMA

  • A novel tool for gene-based and gene-set analysis.
  • Aggregates SNP associations to nearest genes.
  • Allows flexibility to independently change and expand both the gene and the gene-set analysis.

H-MAGMA (Hi-C-coupled MAGMA)

  • It Advances Magma as it uses chromatin interaction profile to assign intergenic and non-coding SNPs with their related genes.
  • Analyzes gene regulatory relationships in the disease-relevant tissue and identifies relevant target genes, whereas MAGMA does not take into account tissue-specific regulatory relationships.
  • H-MAGMA assign noncoding SNPs to their genes based on long-range interactions in disease-relevant tissues measured by Hi-C whereas MAGMA assigns SNPs to the nearest genes.

Running a gene-based analysis in MAGMA

  • Three main steps
    1. Annotation: map SNPs onto genes
    2. Gene analysis: compute association of genes with phenotype
    3. Gene-set analysis: test gene associations in gene sets

1. Annotation

  • Map SNPs to a gene based on physical location
    • If located inside the transcription region of the gene
    • Optionally, if located in window around the gene
      • Especially upstream of transcription start site
    • A SNP can be mapped to multiple genes
In [40]:
#### Step 1: annotation ####
# In this step we will annotate the SNPs in the data to genes. To do so, use the command:

assoc <- fread("BMI_gwas_plink.assoc.linear")
assoc <- assoc[order(assoc$P),]
head(assoc)

SNPLOC <- fread("BMI_gwas_plink.assoc.linear")
SNPLOC <- SNPLOC[order(SNPLOC$P),]
#SNPLOC <- SNPLOC[1:1084954,c(2,1,3)]
SNPLOC <- SNPLOC[,c(2,1,3)]
head(SNPLOC); dim(SNPLOC)
write.table(SNPLOC,"SNPLOC.txt", row.names = FALSE, quote = FALSE, col.names = FALSE, sep = "\t", na = "NA")
A data.table: 6 × 9
CHRSNPBPA1TESTNMISSBETASTATP
<int><chr><int><chr><chr><int><dbl><dbl><dbl>
4rs2412493 54638699GADD379-1.672-5.1135.120e-07
4rs2590767 54618574TADD379-1.633-4.9601.081e-06
3rs6775393188504998TADD379 1.074 4.8781.597e-06
6rs4333386169415956CADD379-1.189-4.7253.280e-06
3rs6778476188544680GADD379 1.009 4.6364.934e-06
6rs655307 169427629AADD379-1.231-4.6105.562e-06
A data.table: 6 × 3
SNPCHRBP
<chr><int><int>
rs24124934 54638699
rs25907674 54618574
rs67753933188504998
rs43333866169415956
rs67784763188544680
rs655307 6169427629
  1. 851065
  2. 3
In [8]:
cat(system(
"/home/rasyed2/Tool/magma --annotate \\
--snp-loc SNPLOC.txt \\
--gene-loc /projects/bga_lab/DATA_REPOSITORIES/IMP_PIPELINE_FILES/Magma/NCBI37.3.gene.loc \\
--out ANNOT", 
intern=TRUE), sep='\n')
Welcome to MAGMA v1.10 (linux)
Using flags:
	--annotate
	--snp-loc SNPLOC.txt
	--gene-loc /projects/bga_lab/DATA_REPOSITORIES/IMP_PIPELINE_FILES/Magma/NCBI37.3.gene.loc
	--out ANNOT

Start time is 13:25:24, Monday 01 Aug 2022

Starting annotation...
Reading gene locations from file /projects/bga_lab/DATA_REPOSITORIES/IMP_PIPELINE_FILES/Magma/NCBI37.3.gene.loc... 
	19427 gene locations read from file
	chromosome  1: 2016 genes
	chromosome  2: 1226 genes
	chromosome  3: 1050 genes
	chromosome  4: 745 genes
	chromosome  5: 856 genes
	chromosome  6: 1016 genes
	chromosome  7: 906 genes
	chromosome  8: 669 genes
	chromosome  9: 775 genes
	chromosome 10: 723 genes
	chromosome 11: 1275 genes
	chromosome 12: 1009 genes
	chromosome 13: 320 genes
	chromosome 14: 595 genes
	chromosome 15: 586 genes
	chromosome 16: 817 genes
	chromosome 17: 1147 genes
	chromosome 18: 271 genes
	chromosome 19: 1389 genes
	chromosome 20: 527 genes
	chromosome 21: 215 genes
	chromosome 22: 442 genes
	chromosome  X: 805 genes
	chromosome  Y: 47 genes
Reading SNP locations from file SNPLOC.txt... 
	851065 SNP locations read from file
	of those, 381802 (44.86%) mapped to at least one gene
Writing annotation to file ANNOT.genes.annot
	for chromosome  1, 285 genes are empty (out of 2016)
	for chromosome  2, 121 genes are empty (out of 1226)
	for chromosome  3, 70 genes are empty (out of 1050)
	for chromosome  4, 72 genes are empty (out of 745)
	for chromosome  5, 67 genes are empty (out of 856)
	for chromosome  6, 204 genes are empty (out of 1016)
	for chromosome  7, 108 genes are empty (out of 906)
	for chromosome  8, 93 genes are empty (out of 669)
	for chromosome  9, 111 genes are empty (out of 775)
	for chromosome 10, 63 genes are empty (out of 723)
	for chromosome 11, 186 genes are empty (out of 1275)
	for chromosome 12, 90 genes are empty (out of 1009)
	for chromosome 13, 15 genes are empty (out of 320)
	for chromosome 14, 56 genes are empty (out of 595)
	for chromosome 15, 80 genes are empty (out of 586)
	for chromosome 16, 135 genes are empty (out of 817)
	for chromosome 17, 181 genes are empty (out of 1147)
	for chromosome 18, 12 genes are empty (out of 271)
	for chromosome 19, 159 genes are empty (out of 1389)
	for chromosome 20, 42 genes are empty (out of 527)
	for chromosome 21, 43 genes are empty (out of 215)
	for chromosome 22, 58 genes are empty (out of 442)
	for chromosome  X, 805 genes are empty (out of 805)
	for chromosome  Y, 47 genes are empty (out of 47)
	at least one SNP mapped to each of a total of 16324 genes (out of 19427)


End time is 13:25:26, Monday 01 Aug 2022 (elapsed: 00:00:02)
In [10]:
cat(system(
"
head ANNOT.genes.annot
", 
intern=TRUE), sep='\n')
# window_up = 0
# window_down = 0
148398	1:859993:879961	rs28576697	rs7523549
26155	1:879583:894679	rs3748592	rs2340582	rs4246503	rs3748597	rs17160698	rs28415373	rs3748594	rs3748593
375790	1:955503:991499	rs2710888	rs13303287	rs3128097	rs17160776	rs6657048	rs9778201
54991	1:1017198:1051736	rs9442398	rs3737728	rs10907177	rs3766191	rs11260595	rs9442372	rs4970405
254173	1:1109286:1133315	rs10907175	rs6668667	rs12092254
8784	1:1138888:1142163	rs11466691	rs11466681
7293	1:1146706:1149703	rs34945898
51150	1:1152288:1167447	rs6603781	rs7515488	rs3766186	rs3813199	rs6671424	rs11721	rs12036216	rs11260562
  • MAGMA performs annotation, mapping SNPs to genes based on the transcription region of each gene. A SNP is mapped to a gene if it is located either inside the transcription region of the gene.
  • The --snp-loc flag specifies which file to use to read the SNP locations from, and the --gene-loc flag specifies the file that defines the gene locations. The latter contains one row per gene, with the values: gene ID, chromosome, transcription start and stop site (in base-pair position), genomic strand (this relates to the direction in which the gene is transcribed: coded as + for the positive/sense strand and - for the negative/antisense strand), and official gene symbol.
  • Running this command will create the file ANNOT.genes.annot, containing the mapping of SNPs to genes. This will be used as an input file for the gene analysis. Each row in the file corresponds to a gene, containing: the gene ID, the mapping region (chromosome:start:stop), and then the list of SNP IDs mapped to that gene. A ANNOT.log file will also be created, containing the output that was also printed to the screen. It provides you with useful information about the steps that were performed (such as the number of values read from input files or printed to output files), as well as any warnings and errors that occurred during execution.

2. Gene analysis

3 models available in MAGMA

- SNP-wise models: compute SNP associations with phenotype first
    - SNP-wise Mean: performs test on mean SNP association
    - SNP-wise Top: performs test on strongest SNP association
    - SNP-wise Multi: combines SNP-wise Top and Mean
In [14]:
#### Step 2: gene analysis ####
#In this step we will run a gene analysis, which will perform a test of association for each gene and 
#create the input file needed for all subsequent gene-set analyses. We will do so using the command:
In [15]:
getwd()
'/scratch/silo1/BGA_LAB/workshop/analysis'
In [16]:
fwrite(assoc[,c("SNP","P")],"sumStat_rs.txt",sep = "\t",col.names = TRUE)
In [17]:
# Here, g1000_eur and snp_gene.genes.annot denote the reference data file for European ancestry population 
# and the gene location file, respsectively. 

cat(system(
"/home/rasyed2/Tool/magma --bfile /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur \\
--pval sumStat_rs.txt N=379 \\
--gene-annot ANNOT.genes.annot \\
--out GENE_SNP_PValue", 
intern=TRUE), sep='\n')
Welcome to MAGMA v1.10 (linux)
Using flags:
	--bfile /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur
	--pval sumStat_rs.txt N=379
	--gene-annot ANNOT.genes.annot
	--out GENE_SNP_PValue

Start time is 13:27:50, Monday 01 Aug 2022

Loading PLINK-format data...
Reading file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.fam... 503 individuals read
Reading file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.bim... 22665064 SNPs read
Preparing file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.bed... 

Reading SNP synonyms from file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.synonyms (auto-detected)
	read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
	WARNING: detected 133 synonymous SNP pairs in the data
	         skipped all synonym entries involved, synonymous SNPs are kept in analysis
	         writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file sumStat_rs.txt... 
	detected 2 variables in file
	using variable: SNP (SNP id)
	using variable: P (p-value)
	read 851066 lines from file, containing valid SNP p-values for 843717 SNPs in data (99.14% of lines, 3.723% of SNPs in data)
Loading gene annotation from file ANNOT.genes.annot... 
	16324 gene definitions read from file
	found 16307 genes containing valid SNPs in genotype data


Starting gene analysis... 
	using model: SNPwise-mean
	writing gene analysis results to file GENE_SNP_PValue.genes.out
	writing intermediate output to file GENE_SNP_PValue.genes.raw


End time is 13:29:07, Monday 01 Aug 2022 (elapsed: 00:01:17)
In [18]:
Method_2_GENE_SNP_PValue <- fread("GENE_SNP_PValue.genes.out")
Method_2_GENE_SNP_PValue <- Method_2_GENE_SNP_PValue[order(Method_2_GENE_SNP_PValue$P),]
head(Method_2_GENE_SNP_PValue); dim(Method_2_GENE_SNP_PValue) # 14374
A data.table: 6 × 9
GENECHRSTARTSTOPNSNPSNPARAMNZSTATP
<int><int><int><int><int><int><int><dbl><dbl>
286183 86316150163912211266363793.75288.7448e-05
162963195283949852870376 21 63793.61471.5036e-04
55316174855619048563336 4 23793.53942.0055e-04
7439116171735661731935 2 13793.44262.8809e-04
219970115860153858611997 3 13793.42803.0406e-04
14676017 1837971 1928178 35143793.39643.4141e-04
  1. 16307
  2. 9
In [41]:
#  FDR (false discovery rate)
p_adjust_fdr <- p.adjust(Method_2_GENE_SNP_PValue$P, method = "fdr", n = length(Method_2_GENE_SNP_PValue$P))
Method_2_GENE_SNP_PValue_fdr <- cbind(Method_2_GENE_SNP_PValue,p_adjust_fdr)
Method_2_GENE_SNP_PValue_fdr <- as.data.frame(Method_2_GENE_SNP_PValue_fdr)
head(Method_2_GENE_SNP_PValue_fdr)

#dim(Method_2_GENE_SNP_PValue_fdr[Method_2_GENE_SNP_PValue_fdr$p_adjust_fdr < 0.1,]) # 0
A data.frame: 6 × 10
GENECHRSTARTSTOPNSNPSNPARAMNZSTATPp_adjust_fdr
<int><int><int><int><int><int><int><dbl><dbl><dbl>
1286183 86316150163912211266363793.75288.7448e-050.6452136
2162963195283949852870376 21 63793.61471.5036e-040.6452136
3 55316174855619048563336 4 23793.53942.0055e-040.6452136
4 7439116171735661731935 2 13793.44262.8809e-040.6452136
5219970115860153858611997 3 13793.42803.0406e-040.6452136
614676017 1837971 1928178 35143793.39643.4141e-040.6452136
In [42]:
range(Method_2_GENE_SNP_PValue_fdr$p_adjust_fdr)
  1. 0.645213633333333
  2. 1
In [43]:
mean(Method_2_GENE_SNP_PValue_fdr$p_adjust_fdr)
0.989297955183677
In [1]:
help(p.adjust)
In [20]:
NCBI_37 <- fread("/projects/bga_lab/DATA_REPOSITORIES/IMP_PIPELINE_FILES/Magma/NCBI37.3.gene.loc")
NCBI_37$gene <- tolower(NCBI_37$V6)
NCBI_37 <- NCBI_37[,c(1,7)]
Method_2_GENE_SNP_PValue_fdr <- merge(Method_2_GENE_SNP_PValue_fdr,NCBI_37, by.x = "GENE", by.y = "V1", sort = F)
head(Method_2_GENE_SNP_PValue_fdr); dim(Method_2_GENE_SNP_PValue_fdr)
A data.frame: 6 × 11
GENECHRSTARTSTOPNSNPSNPARAMNZSTATPp_adjust_fdrgene
<int><int><int><int><int><int><int><dbl><dbl><dbl><chr>
1286183 86316150163912211266363793.75288.7448e-050.6452136nkain3
2162963195283949852870376 21 63793.61471.5036e-040.6452136znf610
3 55316174855619048563336 4 23793.53942.0055e-040.6452136rsad1
4 7439116171735661731935 2 13793.44262.8809e-040.6452136best1
5219970115860153858611997 3 13793.42803.0406e-040.6452136glyatl2
614676017 1837971 1928178 35143793.39643.4141e-040.6452136rtn4rl1
  1. 16307
  2. 11

H-Magma

In [ ]:
# For H-MAGMA, we used the identical setting except using H-MAGMA inputs guided by Hi-C interaction profiles.
#CRITICAL: If performing H-MAGMA analysis instead of standard MAGMA analysis, it is crucial 
#to use the H-MAGMA provided Annot_File corresponding to your cell/tissue of interest instead of the one generated in step 2

# Link: https://github.com/thewonlab/H-MAGMA/tree/master/Input_Files
In [25]:
cat(system(
"/home/rasyed2/Tool/magma --bfile /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur \\
--pval sumStat_rs.txt N=379 \\
--gene-annot /scratch/silo1/BGA_LAB/workshop/MAGMAdefault.genes.annot \\
--out H_GENE_SNP_PValue", 
intern=TRUE), sep='\n')
Welcome to MAGMA v1.10 (linux)
Using flags:
	--bfile /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur
	--pval sumStat_rs.txt N=379
	--gene-annot /scratch/silo1/BGA_LAB/workshop/MAGMAdefault.genes.annot
	--out H_GENE_SNP_PValue

Start time is 13:48:27, Monday 01 Aug 2022

Loading PLINK-format data...
Reading file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.fam... 503 individuals read
Reading file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.bim... 22665064 SNPs read
Preparing file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.bed... 

Reading SNP synonyms from file /scratch/silo1/BGA_LAB/workshop/dataset/Reference/EUR/g1000_eur.synonyms (auto-detected)
	read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
	WARNING: detected 133 synonymous SNP pairs in the data
	         skipped all synonym entries involved, synonymous SNPs are kept in analysis
	         writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file sumStat_rs.txt... 
	detected 2 variables in file
	using variable: SNP (SNP id)
	using variable: P (p-value)
	read 851066 lines from file, containing valid SNP p-values for 843717 SNPs in data (99.14% of lines, 3.723% of SNPs in data)
Loading gene annotation from file /scratch/silo1/BGA_LAB/workshop/MAGMAdefault.genes.annot... 
	56737 gene definitions read from file
	found 53816 genes containing valid SNPs in genotype data


Starting gene analysis... 
	using model: SNPwise-mean
	writing gene analysis results to file H_GENE_SNP_PValue.genes.out
	writing intermediate output to file H_GENE_SNP_PValue.genes.raw


End time is 13:55:06, Monday 01 Aug 2022 (elapsed: 00:06:39)
In [26]:
H_GENE_SNP_PValue <- fread("H_GENE_SNP_PValue.genes.out")
H_GENE_SNP_PValue <- H_GENE_SNP_PValue[order(H_GENE_SNP_PValue$P),]
head(H_GENE_SNP_PValue); dim(H_GENE_SNP_PValue) # 14374
A data.table: 6 × 9
GENECHRSTARTSTOPNSNPSNPARAMNZSTATP
<chr><int><int><int><int><int><int><dbl><dbl>
ENSG00000090402 3164686686164831283 13 43793.85965.6778e-05
ENSG00000185942 8 63126501 63922211266363793.81806.7258e-05
ENSG00000253975 8 63384757 63431370 25 63793.73589.3544e-05
ENSG0000027589717 48518180 48565752 5 23793.68081.1624e-04
ENSG0000013644417 48521161 48573336 5 23793.68081.1624e-04
ENSG0000013645717 48531859 48581327 6 23793.60491.5613e-04
  1. 53816
  2. 9
In [27]:
#  FDR
p_adjust_fdr <- p.adjust(H_GENE_SNP_PValue$P, method = "fdr", n = length(H_GENE_SNP_PValue$P))
H_GENE_SNP_PValue_fdr <- cbind(H_GENE_SNP_PValue,p_adjust_fdr)
H_GENE_SNP_PValue_fdr <- as.data.frame(H_GENE_SNP_PValue_fdr)
#dim(Method_3_H_GENE_SNP_PValue_fdr[Method_3_H_GENE_SNP_PValue_fdr$p_adjust_fdr < 0.1,]) # 1 12

#Method_3_H_GENE_SNP_PValue_fdr <- Method_3_H_GENE_SNP_PValue_fdr[Method_3_H_GENE_SNP_PValue_fdr$p_adjust_fdr < 0.1,]
head(H_GENE_SNP_PValue_fdr)
A data.frame: 6 × 10
GENECHRSTARTSTOPNSNPSNPARAMNZSTATPp_adjust_fdr
<chr><int><int><int><int><int><int><dbl><dbl><dbl>
1ENSG00000090402 3164686686164831283 13 43793.85965.6778e-050.8750896
2ENSG00000185942 8 63126501 63922211266363793.81806.7258e-050.8750896
3ENSG00000253975 8 63384757 63431370 25 63793.73589.3544e-050.8750896
4ENSG0000027589717 48518180 48565752 5 23793.68081.1624e-040.8750896
5ENSG0000013644417 48521161 48573336 5 23793.68081.1624e-040.8750896
6ENSG0000013645717 48531859 48581327 6 23793.60491.5613e-040.8750896

FUMA

FUMA is a web source that is used to petform functional annotation of GWAS results + gene prioritization + interactive visualization. This is crucial because the results from GWAS will not pinpoint the causal variants directly. The input is GWAS summary statistics. Link: (https://fuma.ctglab.nl/)

  • “The SNP2GENE function takes GWAS summary statistics as an input, and provides extensive functional annotation for all SNPs in genomic areas identified by lead SNPs”
  • How do I get started with FUMA?

    • Register with your email address
    • Summary statistics ready with rsIDs
  • What kind of information can I get?

    • GWAS-based and gene-based Manhattan and QQ plots
    • Gene set analysis
    • Tissue expression analysis
In [45]:
sum_stat <- fread("BMI_gwas_plink.assoc.linear")
sum_stat <- sum_stat[order(sum_stat$P),]
head(sum_stat); dim(sum_stat)
A data.table: 6 × 9
CHRSNPBPA1TESTNMISSBETASTATP
<int><chr><int><chr><chr><int><dbl><dbl><dbl>
4rs2412493 54638699GADD379-1.672-5.1135.120e-07
4rs2590767 54618574TADD379-1.633-4.9601.081e-06
3rs6775393188504998TADD379 1.074 4.8781.597e-06
6rs4333386169415956CADD379-1.189-4.7253.280e-06
3rs6778476188544680GADD379 1.009 4.6364.934e-06
6rs655307 169427629AADD379-1.231-4.6105.562e-06
  1. 851065
  2. 9
In [ ]:
fwrite(sum_stat[,c("SNP","P")],"sumStat.txt",sep = "\t",col.names = TRUE)
# upload to the fuma website
In [46]:
dim(sum_stat[sum_stat$P < 5e-6,])
  1. 5
  2. 9

3. SNP-based heritability, Genetic correlations and MLMA model

Heritability

Heritability (h2) quantifies the degree by which genetic factors contribute to a trait.

$$h2 = \frac{{V}_{G}}{{V}_{P}} \:\:\:\:\:\: where \:\: {V}_{P} = {V}_{G} + {V}_{E} $$


Variance as a measure of individual difference in a phenotype (Vp) can emerge due to genetic difference in a population (Vg) and environmental variance Ve,

h2 = 1, all variation in a population is due to genetic variations (i.e., there is no environmentally caused variation).
h2 = 0, all variation in the population comes from differences in the environments

Running a heritability in GCTA

Genome-wide Complex Trait Analysis (GCTA) software is used to calculate a relatedness matrix and perform additional analysis. (Developed by Jian Yang and colleagues from the University of Queensland in Australia)

Basic applications of GCTA:
We can estimate the relatedness of individuals and SNP-heritability by calculating a Genomic Relationship Matrix (GRM) using the command –make-grm

Genetic Relation Matrix (GRM)

  • A GRM is a matrix (of dimension say n,m), which contains coefficients of genetic relationships.
  • IBD sharing (e.g., 0.5 for fullsibs)
In [32]:
# calculating the genetic relationship matrix (GRM) from all the autosomal SNPs
cat(system(
"gcta193 --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc \\
--autosome --thread-num 100 \\
--make-grm-bin \\
--out BMI", 
intern=TRUE), sep='\n')
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.2 beta Linux
* (C) 2010-present, Jian Yang, The University of Queensland
* Please report bugs to Jian Yang <jian.yang.qt@gmail.com>
*******************************************************************
Analysis started at 14:04:22 EDT on Mon Aug 01 2022.
Hostname: dataminer

Accepted options:
--bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc
--autosome
--thread-num 100
--make-grm-bin
--out BMI

Note: the program will be running on 100 threads.

Reading PLINK FAM file from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam].
379 individuals to be included from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam].
Reading PLINK BIM file from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim].
851065 SNPs to be included from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim].
851065 SNPs from chromosome 1 to chromosome 22 are included in the analysis.
Reading PLINK BED file from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed] in SNP-major format ...
Genotype data for 379 individuals and 851065 SNPs to be included from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed].
Calculating allele frequencies ...
Recoding genotypes (individual major mode) ...

Calculating the genetic relationship matrix (GRM) ... (Note: default speed-optimized mode, may use huge RAM)

Summary of the GRM:
Mean of diagonals = 0.999611
Variance of diagonals = 0.000288368
Mean of off-diagonals = -0.00264448
Variance of off-diagonals = 6.05971e-05
GRM of 379 individuals has been saved in the file [BMI.grm.bin] (in binary format).
Number of SNPs to calcuate the genetic relationship between each pair of individuals has been saved in the file [BMI.grm.N.bin] (in binary format).
IDs for the GRM file [BMI.grm.bin] have been saved in the file [BMI.grm.id].

Analysis finished at 14:04:57 EDT on Mon Aug 01 2022
Overall computational time: 34.92 sec.
In [33]:
# Heritability

cat(system(
"gcta193 --reml --grm BMI \\
--qcovar pcs_EU_QC.txt \\
--pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt \\
--thread-num 100 \\
--out BMI_h2", 
intern=TRUE), sep='\n')
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.2 beta Linux
* (C) 2010-present, Jian Yang, The University of Queensland
* Please report bugs to Jian Yang <jian.yang.qt@gmail.com>
*******************************************************************
Analysis started at 14:04:57 EDT on Mon Aug 01 2022.
Hostname: dataminer

Accepted options:
--reml
--grm BMI
--qcovar pcs_EU_QC.txt
--pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt
--thread-num 100
--out BMI_h2

Note: the program will be running on 100 threads.

Reading IDs of the GRM from [BMI.grm.id].
379 IDs read from [BMI.grm.id].
Reading the GRM from [BMI.grm.bin].
GRM for 379 individuals are included from [BMI.grm.bin].
Reading phenotypes from [/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt].
Non-missing phenotypes of 1093 individuals are included from [/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt].
Reading quantitative covariates from [pcs_EU_QC.txt].
20 quantitative covariate(s) of 380 individuals read from [pcs_EU_QC.txt].

20 quantitative variable(s) included as covariate(s).
379 individuals are in common in these files.

Performing  REML analysis ... (Note: may take hours depending on sample size).
379 observations, 21 fixed effect(s), and 2 variance component(s)(including residual variance).
Calculating prior values of variance components by EM-REML ...
Updated prior values: 4.40028 4.39997
logL: -574.862
Running AI-REML algorithm ...
Iter.	logL	V(G)	V(e)	
1	-574.81	4.66789	4.17832	
2	-574.79	5.24982	3.69724	
3	-574.78	5.23997	3.70807	
4	-574.78	5.24038	3.70768	
Log-likelihood ratio converged.

Calculating the logLikelihood for the reduced model ...
(variance component 1 is dropped from the model)
Calculating prior values of variance components by EM-REML ...
Updated prior values: 8.86019
logL: -575.05495
Running AI-REML algorithm ...
Iter.	logL	V(e)	
1	-575.01	8.86352	
2	-575.01	8.87072	
3	-575.01	8.87072	
Log-likelihood ratio converged.

Summary result of REML analysis:
Source	Variance	SE
V(G)	5.240379	7.713765
V(e)	3.707676	7.562342
Vp	8.948055	0.684677
V(G)/Vp	0.585644	0.851357

Sampling variance/covariance of the estimates of variance components:
5.950218e+01	-5.811121e+01	
-5.811121e+01	5.718902e+01	

Summary result of REML analysis has been saved in the file [BMI_h2.hsq].

Analysis finished at 14:04:57 EDT on Mon Aug 01 2022
Overall computational time: 0.32 sec.
In [36]:
cat(system(
"cat BMI_h2.hsq", 
intern=TRUE), sep='\n')
Source	Variance	SE
V(G)	5.240379	7.713765
V(e)	3.707676	7.562342
Vp	8.948055	0.684677
V(G)/Vp	0.585644	0.851357
logL	-574.782
logL0	-575.013
LRT	0.462
df	1
Pval	2.4833e-01
n	379

GCTA decompose the variance of BMI into parts:

  1. V(G) which is the variance explained by additive genetics
  2. V(e) which is the part due to the environment (the non genetic part).

The estimate of heritability (h2snp) is the proportion of V(G) over the total variance of the phenotype (vp). The result is 0.58. We therefore estimate that almost 58% of the variance of this phenotype (BMI) is attributable to genetics

Degree of uncertainty:
An indication of the precision of the estimation of heritability of h2snp is given by the standard error of the estimation (SE). In this case the SE is 0.85, which is very high. This means that the estimate is imprecise. GCTA performs a statistical test called a LOG-likelihood ratio test that indicates how the h2snp estimate is different from zero . In this case given the value of 0.25, we fail to reject the null hypothesis. We cannot conclude that there is no evidence of a genetic component of BMI. We need a larger sample size to detect it using GCTA.

Estimation heritability from summary statistics

  • Heritability of polygenic traits can be estimated from summary statistics.
  • In polygenic traits, SNPs that are highly correlated with many other SNPs (i.e have a LD score) are more likely to “tag” a causal SNP.
  • In a GWAS, these SNPs with a high LD score are expected to have higher association statistics than SNPs with a lower LD score.
  • We can estimate SNP-heritability by running a linear regression of all the association test statistics from a GWAS on the LD scores of each SNP, called LD score regression (LDSC).
  • In contrast with the genomic-relatedness-based restricted maximum-likelihood (GREML) method, LDSC does not require individual genetic data.
In [ ]:
cat(system(
"/projects/bga_lab/Rameez/Tool/miniconda3/envs/ldsc/bin/python \\
/projects/bga_lab/Rameez/Tool/ldsc/ldsc.py \\
--help", 
intern=TRUE), sep='\n')

We require three sources of data

  • summary statistics from a GWAS
  • a list of hapmap SNPs
  • LD scores for each SNP.

LD scores can be calculated using LDSC, but if using GWAS results from European or Asian populations, we can also download from HapMap3 SNPs from the LDSC website (https://github.com/bulik/ldsc)

The code provides a guide on how to estimate the SNP-heritability for adult height on individuals with European ancestry, using 1000G reference sample in LD Score regression.

This is a command in LDSC that is designed to prepare the summary statistics, which is called munge_sum-stats.py

In [29]:
cat(system(
"zcat /scratch/silo1/BGA_LAB/workshop/dataset/Giant_Height2018.txt.gz | head", 
intern=TRUE), sep='\n')
CHR	POS	SNP	Tested_Allele	Other_Allele	Freq_Tested_Allele_in_HRS	BETA	SE	P	N
7	92383888	rs10	A	C	0.06431	-0.0066	0.0035	 6.0e-02	605309
12	126890980	rs1000000	A	G	0.2219	-0.0001	0.0017	 9.5e-01	706961
4	21618674	rs10000010	T	C	0.5086	-0.0022	0.0014	 1.2e-01	697417
4	1357325	rs10000012	C	G	0.8634	 0.0143	0.0021	 4.5e-12	709514
4	37225069	rs10000013	A	C	0.7708	 0.0015	0.0017	 3.9e-01	704912
4	84778125	rs10000017	T	C	0.2284	 0.0019	0.0017	 2.8e-01	703109
3	183635768	rs1000002	T	C	0.4884	 0.0040	0.0014	 5.1e-03	709522
4	95733906	rs10000023	T	G	0.5817	 0.0029	0.0015	 4.6e-02	691905
4	156176217	rs10000027	C	G	0.771	-0.0013	0.0019	 4.8e-01	526087
In [ ]:
# assoc_bim <- merge(assoc,data_bim,by.x = "SNP",by.y = "V2")
# assoc_bim <- assoc_bim[,c(1:4,14,5:9)]
# colnames(assoc_bim)[5] <- "A2"
# head(assoc_bim)
In [30]:
cat(system(
"/projects/bga_lab/Rameez/Tool/miniconda3/envs/ldsc/bin/python \\
/projects/bga_lab/Rameez/Tool/ldsc/munge_sumstats.py \\
--sumstats /scratch/silo1/BGA_LAB/workshop/dataset/Giant_Height2018.txt.gz \\
--snp SNP --p P --a1 Tested_Allele --a2 Other_Allele --N-col N \\
--ignore SE,CHR,POS,Freq_Tested_Allele_in_HRS \\
--out height
", 
intern=TRUE), sep='\n')
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./munge_sumstats.py \
--out height \
--N-col N \
--a1 Tested_Allele \
--a2 Other_Allele \
--snp SNP \
--sumstats /scratch/silo1/BGA_LAB/workshop/dataset/Giant_Height2018.txt.gz \
--ignore SE,CHR,POS,Freq_Tested_Allele_in_HRS \
--p P 

Interpreting column names as follows:
Other_Allele:	Allele 2, interpreted as non-ref allele for signed sumstat.
N:	Sample size
P:	p-Value
BETA:	[linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)
Tested_Allele:	Allele 1, interpreted as ref allele for signed sumstat.
SNP:	Variant ID (e.g., rs number)

Reading sumstats from /scratch/silo1/BGA_LAB/workshop/dataset/Giant_Height2018.txt.gz into memory 5000000 SNPs at a time.
.WARNING: 104 SNPs had P outside of (0,1]. The P column may be mislabeled.
 done
Read 2334001 SNPs from --sumstats file.
Removed 0 SNPs with missing values.
Removed 0 SNPs with INFO <= 0.9.
Removed 0 SNPs with MAF <= 0.01.
Removed 104 SNPs with out-of-bounds p-values.
Removed 358003 variants that were not SNPs or were strand-ambiguous.
1975894 SNPs remain.
Removed 0 SNPs with duplicated rs numbers (1975894 SNPs remain).
Removed 0 SNPs with N < 472801.333333 (1975894 SNPs remain).
Median value of BETA was 0.0, which seems sensible.
Writing summary statistics for 1975894 SNPs (1975894 with nonmissing beta) to height.sumstats.gz.

Metadata:
Mean chi^2 = 8.626
Lambda GC = 3.61
Max chi^2 = 1424.989
113870 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Mon Aug  1 14:04:07 2022
Total time elapsed: 41.32s

The output explains:

  1. how many SNPs have been read from LDSC
  2. how many have been filtered out due to low allele frequencies or incorrect information in other columns.

Next step involves the calculation of LD score regression and the estimation of SNP-heritability. This can be done using the command ldsc.py

In [ ]:
cat(system(
"/projects/bga_lab/Rameez/Tool/miniconda3/envs/ldsc/bin/python \\
/projects/bga_lab/Rameez/Tool/ldsc/ldsc.py \\
--h2 /scratch/silo1/BGA_LAB/workshop/analysis/height.sumstats.gz \\
--ref-ld-chr /scratch/silo1/BGA_LAB/workshop/dataset/ld_scores/eur_w_ld_chr/ \\
--w-ld-chr /scratch/silo1/BGA_LAB/workshop/dataset/ld_scores/eur_w_ld_chr/ \\
--out height_h2
", 
intern=TRUE), sep='\n')

The results are stored in the newly created file height_h2.log. The log file reports:

  1. how many SNPs were provided by the summary statistics
  2. how they merged with the LD scoes.
In [49]:
cat(system(
"
cat height_h2.log
", 
intern=TRUE), sep='\n')
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 /scratch/silo1/BGA_LAB/workshop/analysis/height.sumstats.gz \
--ref-ld-chr /scratch/silo1/BGA_LAB/workshop/dataset/ld_scores/eur_w_ld_chr/ \
--out height_h2 \
--w-ld-chr /scratch/silo1/BGA_LAB/workshop/dataset/ld_scores/eur_w_ld_chr/ 

Beginning analysis at Mon Aug  1 14:04:08 2022
Reading summary statistics from /scratch/silo1/BGA_LAB/workshop/analysis/height.sumstats.gz ...
Read summary statistics for 1975894 SNPs.
Reading reference panel LD Score from /scratch/silo1/BGA_LAB/workshop/dataset/ld_scores/eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 1290028 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from /scratch/silo1/BGA_LAB/workshop/dataset/ld_scores/eur_w_ld_chr/[1-22] ... (ldscore_fromlist)
Read regression weight LD Scores for 1290028 SNPs.
After merging with reference panel LD, 1013790 SNPs remain.
After merging with regression SNP LD, 1013790 SNPs remain.
Using two-step estimator with cutoff at 30.
Total Observed scale h2: 0.4552 (0.0193)
Lambda GC: 3.6129
Mean Chi^2: 8.8734
Intercept: 1.8969 (0.0452)
Ratio: 0.1139 (0.0057)
Analysis finished at Mon Aug  1 14:04:22 2022
Total time elapsed: 13.47s

λ > 1.1 or intercept not close to 1 -> indicates inflation such as population stratification, for example due to the mismatch between LD reference panel and GWAS samples

λ = 1 -> perfect match between LD reference panel and GWAS samples

If the LD reference sample derived from African population and sum stat is EUR, we should expect λ > 1.1 due to the population stratification.

Running a GWAS in mixed linear model based association analysis (MLMA)

In [34]:
cat(system(
"gcta193 --mlma --bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc --grm BMI \\
--qcovar pcs_EU_QC.txt \\
--pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt \\
--thread-num 100 \\
--out BMI", 
intern=TRUE), sep='\n')
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.2 beta Linux
* (C) 2010-present, Jian Yang, The University of Queensland
* Please report bugs to Jian Yang <jian.yang.qt@gmail.com>
*******************************************************************
Analysis started at 14:04:57 EDT on Mon Aug 01 2022.
Hostname: dataminer

Accepted options:
--mlma 
--bfile /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc
--grm BMI
--qcovar pcs_EU_QC.txt
--pheno /scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt
--thread-num 100
--out BMI

Note: the program will be running on 100 threads.

Reading PLINK FAM file from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam].
379 individuals to be included from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam].
Reading PLINK BIM file from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim].
851065 SNPs to be included from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim].
Reading PLINK BED file from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed] in SNP-major format ...
Genotype data for 379 individuals and 851065 SNPs to be included from [/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed].
Reading phenotypes from [/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt].
Non-missing phenotypes of 1093 individuals are included from [/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt].
Reading quantitative covariates from [pcs_EU_QC.txt].
20 quantitative covariate(s) of 380 individuals read from [pcs_EU_QC.txt].
Reading IDs of the GRM from [BMI.grm.id].
379 IDs read from [BMI.grm.id].
Reading the GRM from [BMI.grm.bin].
GRM for 379 individuals are included from [BMI.grm.bin].
379 individuals are in common in these files.
20 quantitative variable(s) included as covariate(s).

Performing MLM association analyses (including the candidate SNP) ...

Performing  REML analysis ... (Note: may take hours depending on sample size).
379 observations, 21 fixed effect(s), and 2 variance component(s)(including residual variance).
Calculating prior values of variance components by EM-REML ...
Updated prior values: 4.40028 4.39997
logL: -574.862
Running AI-REML algorithm ...
Iter.	logL	V(G)	V(e)	
1	-574.81	4.66789	4.17832	
2	-574.79	5.24982	3.69724	
3	-574.78	5.23997	3.70807	
4	-574.78	5.24038	3.70768	
Log-likelihood ratio converged.
Calculating allele frequencies ...

Running association tests for 851065 SNPs ...

Saving the results of the mixed linear model association analyses of 851065 SNPs to [BMI.mlma] ...

Analysis finished at 14:05:28 EDT on Mon Aug 01 2022
Overall computational time: 30.26 sec.
In [57]:
mlma <- fread("BMI.mlma")
mlma <- mlma[order(mlma$p),]
head(mlma)
A data.table: 6 × 9
ChrSNPbpA1A2Freqbsep
<int><chr><int><chr><chr><dbl><dbl><dbl><dbl>
4rs2412493 54638699GA0.121372-1.6189400.3315391.04428e-06
4rs2590767 54618574TC0.120053-1.5811000.3327722.02107e-06
3rs6775393 188504998TC0.412929 1.0645900.2264222.57877e-06
16rs11649266 52800370AG0.395778-1.0058400.2237266.92957e-06
3rs6778476 188544680GA0.428760 1.0038800.2239417.36726e-06
16rs1077395 52767889GA0.398417-0.9945850.2227167.98056e-06
In [58]:
assoc <- fread("BMI_gwas_plink.assoc.linear")
assoc <- assoc[order(assoc$P),]
head(assoc)
A data.table: 6 × 9
CHRSNPBPA1TESTNMISSBETASTATP
<int><chr><int><chr><chr><int><dbl><dbl><dbl>
4rs2412493 54638699GADD379-1.672-5.1135.120e-07
4rs2590767 54618574TADD379-1.633-4.9601.081e-06
3rs6775393188504998TADD379 1.074 4.8781.597e-06
6rs4333386169415956CADD379-1.189-4.7253.280e-06
3rs6778476188544680GADD379 1.009 4.6364.934e-06
6rs655307 169427629AADD379-1.231-4.6105.562e-06

5. PGS analysis

The idea of a polygenic score is to combine all the risk due to thousands of variants with small effects into a single “gene score”

A polygenic risk score tells you how a person's risk compares to others with a different genetic constitution

Genetic risk for an individual’s disease calculated using the

$${PRS}_{i} = \sum_{j=1}^{m}{\beta}_{j}{G}_{ij} \:\: ({\beta : effect \:size, \: G: genotype, \: i: individual,\: j:trait-association SNP}) $$


In [184]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/score_PRS.PNG")
In [147]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/prs.PNG")
In [148]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/prs_table.PNG")

For PRS calculation, the following analysis process is followed.

  1. GWAS results (summary statistics) and reference panels (1000G) performed in an independent study population using linkage disequilibrium (LD) to build a model
  2. Analysis of association by applying the build model to target genome information.

Target dataset

Genetic QC

Prior to running this script’s quality control (QC) and analysis, QC was performed on these samples.

Pre-Imputation
Using Plink v1.9:

  1. SNP missingness: remove markers which are missing for more than 5% of individuals
  2. Minor allele frequency (MAF): remove markers with a MAF < 1%
  3. SNP missingness: remove individuals where more than 5% of markers are missing
  4. Relatedness – duplicates: remove one individual from any pair if they are known to be duplicates.
  5. Relatedness – relatives: because some individuals in genotype dataset were related due to the study design
  6. Derive principle components (accounting for population stratification)
  7. Repeat QC steps 1-6 on each batch separately
  8. Upload each batch of genotypes to the Michigan Imputation Server and impute missing genotypes using the Haplotype Reference Consortium reference panel (Eagle v2.4 for phasing, Minimac4).

Post-Imputation QC

  1. Convert dosage data to best guess genotype data
  2. Minor allele frequency (MAF): remove markers with a MAF < 1%
  3. SNP missingness: remove markers which are missing for more than 5% of individuals
  4. Hardy-Weinberg Equilibrium: remove markers with large deviations from the equilibrium (p ≤ 10−4)
  5. INFO score/R^2: remove markers with an INFO score less than 0.7

Summary statistics

Prepare independent GWAS analysis results for PRS analysis:

  • Obtained summary statistics data through public websites (dbGap) and checked for inclusion of required information.
  • Chromosome, position, allele information (effect, other allele), (beta), standard error, sample size, p-value.
    • In meta-analysis , we have Z-score instead of beta/OR
      • • Plink –bfile 1K_afr –freq –out MAF_Freq
    • Derive Beta from Z-score and allele frequency
In [185]:
sumstat <- fread("/scratch/silo1/BGA_LAB/workshop/dataset/height.txt")
head(sumstat); dim(sumstat)
A data.table: 6 × 10
CHRPOSSNPTested_AlleleOther_AlleleFreq_Tested_Allele_in_HRSBETASEPN
<int><int><chr><chr><chr><dbl><dbl><dbl><dbl><dbl>
7 92383888rs10 AC0.06431-0.00660.00356.0e-02605309
12126890980rs1000000 AG0.22190-0.00010.00179.5e-01706961
4 21618674rs10000010TC0.50860-0.00220.00141.2e-01697417
4 1357325rs10000012CG0.86340 0.01430.00214.5e-12709514
4 37225069rs10000013AC0.77080 0.00150.00173.9e-01704912
4 84778125rs10000017TC0.22840 0.00190.00172.8e-01703109
  1. 2334001
  2. 10

Reference Panel

A reference panel is simply genomic data obtained from a representative sample of a larger population. For example, the UK Biobank cohort comprises ~500,000 individuals between 40 and 69 years old from the United Kingdom and can be used as a reference for the UK (or more roughly, European) population. Unfortunately, UK Biobank data is only available to researches upon application, so in this example we will use the fully public 1000 Genomes Project Phase III as a reference panel.

The 1000 Genomes Project Phase III comprises genomic data from 2504 individuals from multiple world populations, and although its sample size is relatively small compared to UK Biobank, it has the nice feature of being open for everyone to use it.

IMPORTANT NOTE: Input GWAS summary statistics and your reference panel (or LD matrices) must be in the same build

Clumping + P-value

  • If your target sample is relatively small (<500), it is best to use the data of the corresponding population in the Thousand Genomes Project when calculating the r2 of the LD
  • 1000G (503 EUR, 661 AFR)

LDpred2

  • Recommended 2000 samples for ld reference
  • 362 320 UKBB individuals used for ld reference
  • the target sample can be used as the LD reference panel if the sample size is large enough (e.g. N > 2000; Vilhjálmsson et al., 2015)
  • If you do not have enough data to use as LD reference (e.g. at least 2000 individuals)
    • use pre-computed LD matrices from the UK Biobank panel, made publicly available by LDpred2 authors
      • Note that these matrices are computed using a European population, so they won’t be the best fit for generating -PGS for a non-European population

PRS-cs

  • RS-CS which provides its own LD reference based on 503 individuals from the 1000 Genomes data; Ge et al. (2019) argued that the sample size of the LD reference has little impact on the performance of PRS-CS.
  • From the relatively limited experiments, PRS-CS doesn't seem to be sensitive to the reference, and the two panels gave similar prediction accuracy.
  • 1000 Genome
  • UK BioBank (350 000 EUR)

SBLUP

  • 1000 Genome (503 EUR)

Clumping Method

  • Using summary statistics, using p-value and LD information, how to select markers for PRS analysis
  • Prepare reference panel, download 1000 genome phase 3 data from the website
  • Prepare of target genome information for PRS calculation.
    • Pre/post imputation step required
  • Clumping: selects SNPs with the desired P-value threshold in each LD block
  • Analyze clumping by applying various thresholds referring
  • plink --bfile 1KGP3_AFR --clump sum_stat.txt --clump-p1 0.0001 --clump-r2 0.1 --clump-kb 500 --out CLUMP_OUT
    • --clump = (summary statistic)
    • --clump-p1 = index SNP p-value threshold
    • --clump-p2 = Secondary SNP p-value threshold
    • --clump-r2 = LD threshold
    • --clump-kb clumping section, distance in kb from index SNP to +- corresponding to the SNP selected
  • awk 'NR!=1{print $3}' CLUMP_OUT.clumped > AFR.indepdent.snp
  • PRS calculation using Plink
    • plink --bfile target --score sumstat --extract AFR.indepdent.snp --out AFR
In [167]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/ld_r.PNG")

LDpred2 (Version 2)

  • Improved accuracy of LDpred V1, enabling genome-wide analysis assays developed to Reference Paper Prive et al. LDpred2: better,faster strong
  • As a method of correcting the effect size by correcting the LD value, it adjusts the effect of genetic variation information
  • Preparation for LDpred v2
  • The prepared data is bigsnpr additionally installed through LDpred v2 installation. Convert plink file to rds file using packages R statistical
  • We can also use the target data to estimate LD if reference samples do not match well with the population composition of the base data

PRS-cs

  • Shrinkage factor to correct effect size to adjust the effect of all genetic variation information in summary statistics.
  • Ge et al. Polygenic prediction via Bayesian regression and continuous shrinkage priors, Nature Communications 2019
  • Download the LD reference panel of the race you want to analyze from the website below (1000 Genome project phase 3 Hapmap)
  • ID of SNP in reference must match target’s SNP
In [126]:
## 1KGP3_EAS # LD Reference Genotype File. Should be a (full path) filename prefix to a standard PLINK bed file (without .bed). Make sure that the fam and bim files with same names
cat(system(
"plink --bfile /projects/bga_lab/REPOSITORIES/PRS/ALLchr_1000G_EUR503_biallelic_rs_cleaned \\
--clump /scratch/silo1/BGA_LAB/workshop/dataset/height.txt \\
--clump-p1 0.0001  \\
--clump-r2 0.1 \\
--clump-kb 250 \\
--out CLUMP_OUT", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to CLUMP_OUT.log.
Options in effect:
  --bfile /projects/bga_lab/REPOSITORIES/PRS/ALLchr_1000G_EUR503_biallelic_rs_cleaned
  --clump /scratch/silo1/BGA_LAB/workshop/dataset/height.txt
  --clump-kb 250
  --clump-p1 0.0001
  --clump-r2 0.1
  --out CLUMP_OUT

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 163260 MB successfully, after larger attempt(s) failed.
7486182 variants loaded from .bim file.
503 people (0 males, 0 females, 503 ambiguous) loaded from .fam.
Ambiguous sex IDs written to CLUMP_OUT.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
7486182 variants and 503 people pass filters and QC.
Note: No phenotypes present.
--clump: 13332 clumps formed from 254024 top variants.
Results written to CLUMP_OUT.clumped .
In [127]:
cat(system(
"awk 'NR!=1{print $3}' CLUMP_OUT.clumped >  EUR.indepdent.snp", 
intern=TRUE), sep='\n')

In [128]:
cat(system(
"
plink \\
--bed /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed \\
--bim /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim \\
--fam /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam \\
--extract /scratch/silo1/BGA_LAB/workshop/analysis/EUR.indepdent.snp \\
--score /scratch/silo1/BGA_LAB/workshop/dataset/height.txt 3 4 7 header sum include-cnt double-dosage \\
--out target_score
", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to target_score.log.
Options in effect:
  --bed /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed
  --bim /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim
  --extract /scratch/silo1/BGA_LAB/workshop/analysis/EUR.indepdent.snp
  --fam /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam
  --out target_score
  --score /scratch/silo1/BGA_LAB/workshop/dataset/height.txt 3 4 7 header sum include-cnt double-dosage

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 163260 MB successfully, after larger attempt(s) failed.
851065 variants loaded from .bim file.
379 people (178 males, 201 females) loaded from .fam.
--extract: 5264 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 379 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
5264 variants and 379 people pass filters and QC.
Note: No phenotypes present.
--score: 5264 valid predictors loaded.
--score: Results written to target_score.profile .
In [163]:
# Linear regression 

score <- fread("target_score.profile")
score <- score[,c(2,6)]
head(score); dim(score)
pheno <- fread("/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt")
pheno <- pheno[,c(2,3)]
colnames(pheno) <- c("IID","BMI")
head(pheno)
dim(pheno)
score_merge <- merge(pheno,score,by = "IID", sort = FALSE)
head(score_merge); dim(score_merge)
linear_mod <- lm(BMI ~ SCORESUM, data = score_merge)
summary(linear_mod)
A data.table: 6 × 2
IIDSCORESUM
<chr><dbl>
HG000960.6808
HG000970.5236
HG000991.8347
HG001001.3875
HG001011.2159
HG001021.5197
  1. 379
  2. 2
Warning message in fread("/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt"):
“Detected 3 column names but the data has 4 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.”
A data.table: 6 × 2
IIDBMI
<chr><dbl>
HG0009625.02283
HG0009724.85364
HG0009923.68930
HG0010027.01620
HG0010121.46162
HG0010220.67364
  1. 1092
  2. 2
A data.table: 6 × 3
IIDBMISCORESUM
<chr><dbl><dbl>
HG0009625.022830.6808
HG0009724.853640.5236
HG0009923.689301.8347
HG0010027.016201.3875
HG0010121.461621.2159
HG0010220.673641.5197
  1. 379
  2. 3
Call:
lm(formula = BMI ~ SCORESUM, data = score_merge)

Residuals:
   Min     1Q Median     3Q    Max 
-8.553 -2.047  0.197  1.954  9.607 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.5928     0.2726  90.230   <2e-16 ***
SCORESUM     -0.3947     0.2194  -1.799   0.0728 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.938 on 377 degrees of freedom
Multiple R-squared:  0.008512,	Adjusted R-squared:  0.005882 
F-statistic: 3.236 on 1 and 377 DF,  p-value: 0.07282
In [28]:
getwd()
'/projects/bga_lab/Rameez/Code/Tutorial'

PRS-cs

In [144]:
sumstat <- sumstat[,c("SNP","Tested_Allele","Other_Allele","BETA","P")]
colnames(sumstat)[1:5] <- c("SNP","A1","A2","BETA","P")
head(sumstat)
fwrite(sumstat,"/scratch/silo1/BGA_LAB/workshop/dataset/sumstat.txt",sep = " ")
A data.table: 6 × 5
SNPA1A2BETAP
<chr><chr><chr><dbl><dbl>
rs10 AC-0.00666.0e-02
rs1000000 AG-0.00019.5e-01
rs10000010TC-0.00221.2e-01
rs10000012CG 0.01434.5e-12
rs10000013AC 0.00153.9e-01
rs10000017TC 0.00192.8e-01
In [ ]:
cat(system(
"
python /home/rasyed2/Tool/PRScs/PRScs.py  \\
--ref_dir=/scratch/silo1/BGA_LAB/dbGaP/GTP/PRScsx/LD_ref/ldblk_1kg_eur \\
--bim_prefix=/scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc  \\
--sst_file=/scratch/silo1/BGA_LAB/workshop/dataset/sumstat.txt \\
--n_gwas=704912  \\
--out_dir=prs_cs
", 
intern=TRUE), sep='\n')
In [152]:
cat(system(
"
cat prs_cs_pst_eff_a1_b0.5_phiauto_chr* > prs_cs_pst_eff_a1_b0.5_phiauto_chrALL.txt
", 
intern=TRUE), sep='\n')

In [158]:
new_beta <- fread("prs_cs_pst_eff_a1_b0.5_phiauto_chrALL.txt")
sumstat_new_beta <- merge(new_beta,sumstat,by.x = "V2", by.y = "SNP", sort = FALSE)
sumstat_new_beta <- sumstat_new_beta[,c(2,1,3,4,5,12,6)]
colnames(sumstat_new_beta)[1:7] <- c("CHR","SNP","POS","A1","A2","old_beta","new_beta")
head(sumstat_new_beta); dim(sumstat_new_beta)
A data.table: 6 × 7
CHRSNPPOSA1A2old_betanew_beta
<int><chr><int><chr><chr><dbl><dbl>
1rs3934834 1005806TC 0.00580.0001813916
1rs3766191 1017587TC 0.00580.0004550472
1rs9442372 1018704AG 0.00470.0001893372
1rs109071771021346GA-0.00550.0004500316
1rs3737728 1021415AG 0.00290.0001370022
1rs9442398 1021695AG 0.00220.0000783243
  1. 675851
  2. 7
In [160]:
cat(system(
"
plink \\
--bed /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed \\
--bim /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim \\
--fam /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam \\
--score prs_cs_pst_eff_a1_b0.5_phiauto_chrALL.txt 2 4 6 sum include-cnt double-dosage \\
--out score_PRS_cs
", 
intern=TRUE), sep='\n')
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to score_PRS_cs.log.
Options in effect:
  --bed /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bed
  --bim /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.bim
  --fam /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.fam
  --out score_PRS_cs
  --score prs_cs_pst_eff_a1_b0.5_phiauto_chrALL.txt 2 4 6 sum include-cnt double-dosage

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 163260 MB successfully, after larger attempt(s) failed.
851065 variants loaded from .bim file.
379 people (178 males, 201 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 379 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
851065 variants and 379 people pass filters and QC.
Note: No phenotypes present.
--score: 675851 valid predictors loaded.
--score: Results written to score_PRS_cs.profile .
In [162]:
score <- fread("score_PRS_cs.profile")
score <- score[,c(2,6)]
head(score); dim(score)
pheno <- fread("/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt")
pheno <- pheno[,c(2,3)]
colnames(pheno) <- c("IID","BMI")
head(pheno)
dim(pheno)
score_merge_prscs <- merge(pheno,score,by = "IID", sort = FALSE)
head(score_merge_prscs); dim(score_merge_prscs)
linear_mod <- lm(BMI ~ SCORESUM, data = score_merge_prscs)
summary(linear_mod)
A data.table: 6 × 2
IIDSCORESUM
<chr><dbl>
HG000960.358007
HG000970.442337
HG000990.748373
HG001000.880771
HG001010.657694
HG001020.584239
  1. 379
  2. 2
Warning message in fread("/scratch/silo1/BGA_LAB/workshop/dataset/BMI_pheno.txt"):
“Detected 3 column names but the data has 4 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.”
A data.table: 6 × 2
IIDBMI
<chr><dbl>
HG0009625.02283
HG0009724.85364
HG0009923.68930
HG0010027.01620
HG0010121.46162
HG0010220.67364
  1. 1092
  2. 2
A data.table: 6 × 3
IIDBMISCORESUM
<chr><dbl><dbl>
HG0009625.022830.358007
HG0009724.853640.442337
HG0009923.689300.748373
HG0010027.016200.880771
HG0010121.461620.657694
HG0010220.673640.584239
  1. 379
  2. 3
Call:
lm(formula = BMI ~ SCORESUM, data = score_merge_prscs)

Residuals:
   Min     1Q Median     3Q    Max 
-8.364 -2.130  0.201  1.946  9.297 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.3093     0.1620 150.072   <2e-16 ***
SCORESUM     -0.5099     0.2429  -2.099   0.0365 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.933 on 377 degrees of freedom
Multiple R-squared:  0.01155,	Adjusted R-squared:  0.00893 
F-statistic: 4.406 on 1 and 377 DF,  p-value: 0.03648
In [170]:
cor(score_merge_prscs$SCORESUM, score_merge$SCORESUM )
0.759087466079374
In [172]:
(cor(score_merge_prscs$BMI, score_merge_prscs$SCORESUM ))^2
0.01155219299135

Power and accuracy of PRSs: target sample sizes required

In [169]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/power.png")
In [174]:
library(pwr)
pwr.f2.test(u = 1, f2 = 0.01155/(1 - 0.01155), sig.level = 0.05, power = 0.8)
     Multiple regression power calculation 

              u = 1
              v = 671.6323
             f2 = 0.01168496
      sig.level = 0.05
          power = 0.8
In [175]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/power_Cal.PNG")
In [176]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/power_plot.PNG")

Family-Based Whole exome sequence for Mendelian disease and variant annotation

Data in Genomic Analysis:

The basic assumptions in Genome analysis approaches are that:

1) The disease-causing variants most likely disrupts a gene, damaging its protein product
2) The disease-causing variant is rare
3) The disease causal gene is inherited in a Mendelian fashion implying that: Every affected family members should also have the causal variant.

Protein-changing variants

For coding region variants, protein changes are determined by:

  • Synonymous mutations = no change in amino acid in protein chain
  • Non-synonymous mutations = amino acid substitution in protein chain
  • Frame-shift mutation = a shift in the reading frame due to addition or deletion of amino acids.
  • Loss of function variants = disrupt the protein function significantly by truncating or entirely eliminating protein-coding transcripts and include stop-gain, stop-lost, frameshift insertion/deletion, splice site disruptors and are the most deleterious and best candidates for pathogenic variants.
  • Gain of function variants = proteins gains a new molecular function.
  • Missense variants = can also be pathogenic, deleterious can be predicted from algorithms like PhyloP, SIFT, and MutationTaster, CADD
  • Point mutation = single base is changed
In [27]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/point_mutation.png")

Occurrence of rare variants in the population:

Several public databases are the source of information about population allele frequencies:

These include the ExAC, EVS, and 1K Genome and gnomAD database. These databases aggregate genome/exome sequence data across thousands of individuals from different studies and represent a healthy cohort in which rare variants are present at extremely low frequency or not at all.

In [20]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/mode_of_inherit.png")
In [19]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/recessive.png")
In [26]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/exome_analysis_V2.png")

Understanding the VCF file

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variations found in sequencing your samples.

  • meta-data: This segment contain information used to explain the variant-data contents
    • ##fileformat=VCFv4.0: the current VCF file version. Please refer to the Global Alliance for Genomics and Health Data Working group file format team for more information.
    • ##INFO= … : These rows explain the contents of the INFO column
    • ##FILTER= … : These rows explain the contents of the FILTER column
    • ##FORMAT= … : These rows explain the contents of the FORMAT column
  • variant-data: These are the actual variant data; begins with a header row:
    • #CHROM: chromosome number
    • POS: starting position
    • ID: reference information if available, such as the dbSNP info
    • REF: the base in the reference genome
    • ALT: the alternative base found in this sample
    • QUAL: The Phred-scaled probability
    • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis.
    • INFO: Various site-level annotations
    • FORAMT: how the genotype and other sample-level information is represented. (more on this later)
In [2]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/vcf_format.png")

Genotype Representation Explained

In [4]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/vcf_genotype.png")

GT: The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there’s a single ALT allele (by far the more common case), GT will be either:

  • 0/0 - the sample is homozygous reference
  • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
  • 1/1 - the sample is homozygous alternate

Understanding our disease models

In [24]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/disease_mod.png")

Sample Description

In [23]:
IRdisplay::display_png(file = "/scratch/silo1/BGA_LAB/workshop/figures/Family_Lineage.png")

Step 1: Extract columns from VCF files

In [ ]:
cat(system(
"
bcftools query -f '%CHROM \t%POS \t%ID \t%REF \t%ALT \t%QUAL \t%DP \t%INFO/Func.refGene \t%INFO/Gene.refGene \t%INFO/ExonicFunc.refGene \t%INFO/ExAC_ALL \t%INFO/ExAC_SAS \t%INFO/avsnp147 \t%INFO/SIFT_score \t%INFO/SIFT_pred \t%INFO/Polyphen2_HDIV_score \t%INFO/Polyphen2_HDIV_pred \t%INFO/Polyphen2_HVAR_score \t%INFO/Polyphen2_HVAR_pred \t%INFO/MutationTaster_score \t%INFO/MutationTaster_pred \t%INFO/MutationAssessor_score \t%INFO/MutationAssessor_pred \t%INFO/CADD_raw \t%INFO/CADD_phred \t%INFO/DANN_score [\t%GT] [\t%DP]\n' \\
Fam_merge_file_V1.vcf > fam_dat.txt
", 
intern=TRUE), sep='\n')

Import VCF file

In [11]:
options(repr.matrix.max.cols=150, repr.matrix.max.rows=200)

dat <- fread("/scratch/silo1/rameez/Family/IDP_Families/Fam_1/tmp.txt")

colnames(dat)[1:40] <- c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'DP' ,'Func.refGene' ,
'Gene.refGene' ,'ExonicFunc.refGene', 'ExAC_ALL',
'ExAC_SAS', 'avsnp147', 'SIFT_score' ,'SIFT_pred' ,
'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred' ,
'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred' ,
'MutationTaster_score' ,'MutationTaster_pred' ,
'MutationAssessor_score', 'MutationAssessor_pred' ,
'CADD_raw', 'CADD_phred' ,'DANN_score','F1_1_GT', 'F1_3_GT','F1_5_GT','F1_9_GT','F1_2_GT','F1_7_GT','F1_8_GT',
'F1_1_DP', 'F1_3_DP','F1_5_DP','F1_9_DP','F1_2_DP','F1_7_DP','F1_8_DP')



head(dat); dim(dat)
A data.table: 6 × 40
CHROMPOSIDREFALTQUALDPFunc.refGeneGene.refGeneExonicFunc.refGeneExAC_ALLExAC_SASavsnp147SIFT_scoreSIFT_predPolyphen2_HDIV_scorePolyphen2_HDIV_predPolyphen2_HVAR_scorePolyphen2_HVAR_predMutationTaster_scoreMutationTaster_predMutationAssessor_scoreMutationAssessor_predCADD_rawCADD_phredDANN_scoreF1_1_GTF1_3_GTF1_5_GTF1_9_GTF1_2_GTF1_7_GTF1_8_GTF1_1_DPF1_3_DPF1_5_DPF1_9_DPF1_2_DPF1_7_DPF1_8_DP
<chr><int><chr><chr><chr><dbl><int><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
chr1871334.GT2206.20001377intronicSAMD11. . . rs4072383 . .. .. .. .. .. . . 1/10/10/11/10/10/10/1309136193214155210160
chr1878314.GC 37.2171 55exonic SAMD11synonymous_SNV 0.1068 0.0624rs142558220. .. .. .. .. .. . . 0/1./../../../../../.55 . . . . . .
chr1878619.GA 207.0430 796intronicSAMD11. 5.4e-050.0002rs770459310. .. .. .. .. .. . . 0/10/1./.0/1./.0/10/1238144. 145. 19277
chr1880238.AG1878.7100 869intronicNOC2L . . . rs3748592 . .. .. .. .. .. . . 1/11/11/11/11/11/11/117992 78 12361 25383
chr1881627.GA2706.28001725exonic NOC2L synonymous_SNV 0.5653 0.5648rs2272757 . .. .. .. .. .. . . 1/10/10/11/10/10/10/1388156228307214227205
chr1883549.TC 104.5500 236exonic NOC2L nonsynonymous_SNV0.0014 0.0035rs1496252800.25T0.22B0.165B1.000D1.675L0.7008.8290.9390/1./.0/10/10/1./../.105. 43 43 45 . .
  1. 50316
  2. 40

Step 2: Find the Disease Model Variant

We will demonstrate the disease model Autosomal Recessive Homogyzous

In [12]:
dat_homo <- dat[ (dat$F1_1_GT == "0/1" & dat$F1_2_GT == "0/1") 
                & (dat$F1_3_GT == "1/1") & (dat$F1_9_GT == "1/1") & (dat$F1_8_GT == "1/1"),]
head(dat_homo); dim(dat_homo)
A data.table: 6 × 40
CHROMPOSIDREFALTQUALDPFunc.refGeneGene.refGeneExonicFunc.refGeneExAC_ALLExAC_SASavsnp147SIFT_scoreSIFT_predPolyphen2_HDIV_scorePolyphen2_HDIV_predPolyphen2_HVAR_scorePolyphen2_HVAR_predMutationTaster_scoreMutationTaster_predMutationAssessor_scoreMutationAssessor_predCADD_rawCADD_phredDANN_scoreF1_1_GTF1_3_GTF1_5_GTF1_9_GTF1_2_GTF1_7_GTF1_8_GTF1_1_DPF1_3_DPF1_5_DPF1_9_DPF1_2_DPF1_7_DPF1_8_DP
<chr><int><chr><chr><chr><dbl><int><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
chr4185940461.CT1475.4201472intronicHELT . . . rs4862491 .............0/11/10/11/10/10/11/1377198205194194136168
chr4186097045.CT1103.2001440exonic CFAP97synonymous_SNV0.48460.4629rs6855305 .............0/11/10/11/10/10/11/1313146205156244224152
chr4186423637.GA1199.8901196exonic PDLIM3synonymous_SNV0.783 0.7463rs4635850 .............0/11/10/11/10/10/11/121716116314913828979
chr4186427841.TC1991.4101930intronicPDLIM3. 0.81370.7507rs2306705 .............0/11/10/11/10/10/11/1483208217273192337220
chr7112579863.GA 856.6631381UTR5 BMT2 . . . rs34728190.............0/11/11/11/10/10/11/1355126159122166336117
chr7121719586.CT 700.830 644intronicAASS . . . rs1206368 .............0/11/11/11/10/10/11/115871 86 67 10110061
  1. 165
  2. 40

Step 3: Find “Coding” Location

In [16]:
dat_homo_cod <- dat_homo[dat_homo$Func.refGene == "exonic" & dat_homo$ExonicFunc.refGene != "synonymous_SNV"  ,]
(dat_homo_cod); dim(dat_homo_cod)
A data.table: 30 × 40
CHROMPOSIDREFALTQUALDPFunc.refGeneGene.refGeneExonicFunc.refGeneExAC_ALLExAC_SASavsnp147SIFT_scoreSIFT_predPolyphen2_HDIV_scorePolyphen2_HDIV_predPolyphen2_HVAR_scorePolyphen2_HVAR_predMutationTaster_scoreMutationTaster_predMutationAssessor_scoreMutationAssessor_predCADD_rawCADD_phredDANN_scoreF1_1_GTF1_3_GTF1_5_GTF1_9_GTF1_2_GTF1_7_GTF1_8_GTF1_1_DPF1_3_DPF1_5_DPF1_9_DPF1_2_DPF1_7_DPF1_8_DP
<chr><int><chr><chr><chr><dbl><int><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
chr7 123514896.GT1119.2001030exonicHYAL4 nonsynonymous_SNV0.15080.1821rs6949082 1.0 T0.0 B0.0 B1 P-2.075N0.922 10.210.4120/11/11/11/10/10/11/1230150147104143142114
chr7 128843396.GA1098.7301047exonicSMO nonsynonymous_SNV0.00950.0532rs61746143 0.007D0.768P0.372B0.954D1.8 L6.748 32 1.0000/11/11/11/10/10/11/1238148133151101151125
chr7 131195712.GA1185.7201386exonicPODXL nonsynonymous_SNV0.42820.3554rs12670788 0.387T0.394B0.042B1 P0 N-1.1590.0080.5060/11/11/11/10/10/11/1291154141161124400115
chr7 134260667.CT1163.4301521exonicAKR1B15 nonsynonymous_SNV0.00030.0007rs5730408040.009D0.994D0.528P1 D2.1 M7.022 33 0.9970/11/11/11/10/10/11/1354176209149211238184
chr14 50101682.CG1373.6301345exonicDNAAF2 nonsynonymous_SNV0.66640.5603rs2985684 0.087T0.946P0.606P0.000P2.375 M3.838 23.4 0.9940/11/11/11/10/10/11/1308157137169141248185
chr14102792386.TC1158.9901159exonicZNF839 nonsynonymous_SNV0.28340.2919rs7158731 1.0 T0.0 B0.0 B1 P-1.39 N0.396 6.5770.6760/11/11/11/10/10/11/1239151108143125244149
chr14102792631.GA 728.340 623exonicZNF839 nonsynonymous_SNV0.28020.2873rs7158139 1.0 T0.0 B0.0 B1 P-0.755N-2.3840.0010.2670/11/11/11/10/10/11/112786 65 51 58 13898
chr14102901204.CG1651.7201417exonicTECPR2 nonsynonymous_SNV0.04240.0525rs45467297 0.208T0.006B0.004B1 N0.895 L-0.7210.0640.8160/11/11/11/10/10/11/1243171171222189242179
chr14105393556.GC1177.050 955exonicPLD4 nonsynonymous_SNV0.45070.4029rs2841280 0.534T0.15 B0.017B1 P0.14 N-0.5470.1690.2860/11/11/11/10/10/11/117299 115163113187106
chr15 63970456.TC 515.997 385exonicHERC1 nonsynonymous_SNV0.44930.3543rs2228510 0.009D0.001B0.001B0.009P1.1 L1.077 11.090.9790/11/1./.1/10/1./.1/114745 . 72 72 . 49
chr15 65369395.CT2264.1001829exonicKBTBD13 nonsynonymous_SNV0.40030.3759rs2919358 0.302T0.889P0.653P0.001P0.655 N6.111 28.3 0.9980/11/1./.1/10/1./.1/1680304. 297243. 305
chr15 65494212.AG1159.090 808exonicCILP nonsynonymous_SNV0.56820.4475rs2073711 0.259T0.0 B0.001B0.006P2.085 M1.790 14.930.8340/11/1./.1/10/1./.1/1288121. 156138. 105
chr15 66641732.GC 753.645 447exonicTIPIN nonsynonymous_SNV0.11180.0963rs2063690 0.324T0.604P0.286B0.000P1.125 L4.531 24.3 0.9880/11/1./.1/10/1./.1/1137101. 39 87 . 83
chr17 2883588.CA1310.8101251exonicRAP1GAP2nonsynonymous_SNV0.33830.4835rs17762452 0.188T0.173B0.174B0.999P0 N2.759 21.1 0.9790/11/11/11/10/10/11/1290146170176112186171
chr17 4461748.GA 555.029 816exonicGGT6 stopgain 0.44230.5673rs7215121 . .. .. .1.000P. .-0.3290.5700.8180/11/10/11/10/10/11/118163 74 11781 198102
chr17 5058808.GA1191.9601082exonicUSP6 nonsynonymous_SNV0.50870.274 rs9899177 0.243T0.999D0.876P0.309P1.75 L5.278 25.7 0.9990/11/10/11/10/10/11/1249160131123144135140
chr17 5087040.AG 645.407 560exonicZNF594 nonsynonymous_SNV0.36250.1665rs9908414 0.893T0.0 B0.001B1 P1.32 L-2.7370.0010.1030/11/10/11/10/10/11/190 87 67 69 76 10665
chr17 6683684.AG 584.615 534exonicFBXO39 nonsynonymous_SNV0.14660.1207rs16956264 0.161T0.0 B0.001B1 P0.345 N-0.2011.0930.8490/11/10/11/10/10/11/112980 53 46 52 10668
chr17 6690164.AG1824.5401827exonicFBXO39 nonsynonymous_SNV0.18250.139 rs7213731 0.219T0.001B0.004B1.000P0 N0.447 7.0100.7320/11/10/11/10/10/11/1398210209247258280225
chr17 7123443.CT2052.7002149exonicACADVL nonsynonymous_SNV0.00030.0008rs7275037880.345T0.009B0.007B1 N1.245 L1.773 14.830.9900/11/10/11/10/10/11/1430243198280201552245
chr17 7155861.GA2811.5502566exonicELP5 nonsynonymous_SNV0.13220.2184rs2521988 0.258T0.0 B0.0 B1 P0.55 N-1.9040.0020.7250/11/10/11/10/10/11/1517379284367240457322
chr17 7346430.CT1409.3701264exonicFGF11 nonsynonymous_SNV0.00020.0012rs5326517290.773T0.243B0.054B1 D0 N1.964 15.980.9420/11/10/11/10/10/11/128618112319712126195
chr17 7385454.GA1064.6001110exonicSLC35G6 nonsynonymous_SNV0.18280.2234rs3760422 0.187T0.999D0.997D0.219P0.695 N2.258 17.890.9980/11/10/11/10/10/11/1277134148143101199108
chr17 7484101.CA1281.6601094exonicCD68 nonsynonymous_SNV0.15450.1503rs9901673 0.053T0.012B0.023B0.879P2.2 M3.173 22.7 0.7230/11/10/11/10/10/11/1202143106172118241112
chr17 7484812.GA 671.824 756exonicCD68 nonsynonymous_SNV0.05780.1034rs9901675 0.755T0.001B0.004B1 P-1.25 N0.303 5.7230.7420/11/10/11/10/10/11/121396 90 79 74 13470
chr17 7490810.GA 796.189 696exonicMPDU1 nonsynonymous_SNV0.154 0.1512rs10852891 0.213T0.602P0.043B0.335P1.975 M2.709 20.9 0.9950/11/10/11/10/10/11/115210960 93 56 17254
chr17 7529902.GA1746.5501542exonicSAT2 nonsynonymous_SNV0.07250.1077rs13894 0.003D1.0 D0.984D0.001P1.57 L7.413 34 0.9990/11/10/11/10/10/11/1288187167247182288183
chr17 7735934.GA1164.4701240exonicDNAH2 nonsynonymous_SNV0.10780.1711rs61745181 0.109T0.955P0.658P0.001P1.645 L6.316 29.2 0.9990/11/10/11/10/10/11/1286166180136130198144
chr17 7761512.CA 625.934 787exonicCYB5D1 nonsynonymous_SNV0.17410.227 rs12453250 0.012D0.964D0.665P0.983P1.205 L4.385 24.1 0.9930/11/10/11/10/10/11/119785 11674 10614663
chr17 8701799.GT1475.6201874exonicMFSD6L nonsynonymous_SNV0.219 0.2621rs17854013 0.34 T0.922P0.598P1.000P1.39 L0.700 8.8300.7790/11/10/11/10/10/11/1434203212201215389220
  1. 30
  2. 40

Step4: variant prioritization

In [17]:
dat_homo_cod_prio <- dat_homo_cod[dat_homo_cod$ExAC_ALL < 0.01 & dat_homo_cod$CADD_phred > 10  ,]
(dat_homo_cod_prio); dim(dat_homo_cod_prio)
A data.table: 4 × 40
CHROMPOSIDREFALTQUALDPFunc.refGeneGene.refGeneExonicFunc.refGeneExAC_ALLExAC_SASavsnp147SIFT_scoreSIFT_predPolyphen2_HDIV_scorePolyphen2_HDIV_predPolyphen2_HVAR_scorePolyphen2_HVAR_predMutationTaster_scoreMutationTaster_predMutationAssessor_scoreMutationAssessor_predCADD_rawCADD_phredDANN_scoreF1_1_GTF1_3_GTF1_5_GTF1_9_GTF1_2_GTF1_7_GTF1_8_GTF1_1_DPF1_3_DPF1_5_DPF1_9_DPF1_2_DPF1_7_DPF1_8_DP
<chr><int><chr><chr><chr><dbl><int><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
chr7 128843396.GA1098.731047exonicSMO nonsynonymous_SNV0.00950.0532rs61746143 0.007D0.768P0.372B0.954D1.8 L6.74832 1.0000/11/11/11/10/10/11/1238148133151101151125
chr7 134260667.CT1163.431521exonicAKR1B15nonsynonymous_SNV0.00030.0007rs5730408040.009D0.994D0.528P1 D2.1 M7.02233 0.9970/11/11/11/10/10/11/1354176209149211238184
chr17 7123443.CT2052.702149exonicACADVL nonsynonymous_SNV0.00030.0008rs7275037880.345T0.009B0.007B1 N1.245L1.77314.830.9900/11/10/11/10/10/11/1430243198280201552245
chr17 7346430.CT1409.371264exonicFGF11 nonsynonymous_SNV0.00020.0012rs5326517290.773T0.243B0.054B1 D0 N1.96415.980.9420/11/10/11/10/10/11/128618112319712126195
  1. 4
  2. 40
In [ ]:

In [ ]:

Appendix

For this workshop, we use publicly available molecular genetic data from the 1000 Genome Project and simulated phenotypes based on publicly available genome-wide association study (GWAS) results for BMI. First, we downloaded the data, extract HapMap3 SNP, ran some basic QC

In [ ]:
setwd("/scratch/silo1/rameez/Tutorial")
In [ ]:
getwd()
In [ ]:
# Download The dataset
cat(system(
"wget https://www.dropbox.com/s/k9ptc4kep9hmvz5/1kg_phase1_all.tar.gz", 
intern=TRUE), sep='\n')
In [ ]:
# unzip File 
cat(system(
"tar -xzf 1kg_phase1_all.tar.gz", 
intern=TRUE), sep='\n')
In [ ]:
# Extract hm3 snps 

cat(system(
"plink --bfile 1kg_phase1_all \\
--extract hm.snplist --make-bed \\
--out 1kg_hm3", 
intern=TRUE), sep='\n')
In [81]:
eur_1000G <- fread("/projects/bga_lab/REPOSITORIES/PRS/ALLchr_1000G_EUR503_biallelic_rs_cleaned.bim")
head(eur_1000G);dim(eur_1000G)
eur_1000G_hapmap <- eur_1000G[eur_1000G$V2 %in% hapmap$rsid, ]
head(eur_1000G_hapmap); dim(eur_1000G_hapmap)
fwrite(as.data.frame(eur_1000G_hapmap$V2), "hap_snps.txt", sep = " ", quote = FALSE, col.names = FALSE)

cat(system(
"plink --bfile /projects/bga_lab/REPOSITORIES/PRS/ALLchr_1000G_EUR503_biallelic_rs_cleaned \\
--extract hap_snps.txt \\
--make-bed --out /scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap  \\
", 
intern=TRUE), sep='\n')
A data.table: 6 × 6
V1V2V3V4V5V6
<int><chr><int><int><chr><chr>
1rs201725126013116GT
1rs200579949013118GA
1rs199856693014933AG
1rs200482301015820TG
1rs199745162016949CA
1rs140337953030923GT
  1. 7486182
  2. 6
A data.table: 6 × 6
V1V2V3V4V5V6
<int><chr><int><int><chr><chr>
1rs3131972 0752721AG
1rs3131969 0754182AG
1rs1048488 0760912CT
1rs125620340768448AG
1rs4040617 0779322GA
1rs4970383 0838555AC
  1. 1052839
  2. 6
PLINK v1.90b6.18 64-bit (16 Jun 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap.log.
Options in effect:
  --bfile /projects/bga_lab/REPOSITORIES/PRS/ALLchr_1000G_EUR503_biallelic_rs_cleaned
  --extract hap_snps.txt
  --make-bed
  --out /scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap

1031971 MB RAM detected; reserving 515985 MB for main workspace.
Allocated 163260 MB successfully, after larger attempt(s) failed.
7486182 variants loaded from .bim file.
503 people (0 males, 0 females, 503 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap.nosex .
--extract: 1052839 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
1052839 variants and 503 people pass filters and QC.
Note: No phenotypes present.
--make-bed to /scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap.bed +
/scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap.bim +
/scratch/silo1/BGA_LAB/workshop/dataset/EUR503_hapmap.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
In [82]:
#sumstat <- fread("/scratch/silo1/BGA_LAB/workshop/dataset/Giant_Height2018.txt.gz")
#head(sumstat); dim(sumstat)
#fwrite(sumstat,"/scratch/silo1/BGA_LAB/workshop/dataset/height.txt", sep = " ")
A data.table: 6 × 10
CHRPOSSNPTested_AlleleOther_AlleleFreq_Tested_Allele_in_HRSBETASEPN
<int><int><chr><chr><chr><dbl><dbl><dbl><dbl><dbl>
7 92383888rs10 AC0.06431-0.00660.00356.0e-02605309
12126890980rs1000000 AG0.22190-0.00010.00179.5e-01706961
4 21618674rs10000010TC0.50860-0.00220.00141.2e-01697417
4 1357325rs10000012CG0.86340 0.01430.00214.5e-12709514
4 37225069rs10000013AC0.77080 0.00150.00173.9e-01704912
4 84778125rs10000017TC0.22840 0.00190.00172.8e-01703109
  1. 2334001
  2. 10

Temp

In [ ]:
cat(system('zcat /scratch/silo1/cbenca/dbGaP/2022/Extract_MVP_Files/phs001672.MVP.analysis-PI.c1.HMB-MDS.set5-release/Submissions/sub20210303/dbGAP_reexperiencing_afr.gz | head', intern=TRUE), sep='\n')
In [ ]:
cat(system("plink --help", intern=TRUE), sep='\n')
In [ ]:
cat(system("/projects/bga_lab/Rameez/Tool/miniconda3/envs/ldsc/bin/python /projects/bga_lab/Rameez/Tool/ldsc/ldsc.py", intern=TRUE), sep='\n')
In [ ]:
cat(system("/projects/bga_lab/Rameez/Tool/miniconda3/envs/ldsc/bin/python /projects/bga_lab/Rameez/Tool/ldsc/ldsc.py --h2 /scratch/silo1/BGA_LAB/dbGaP/GTP/PRS_V2/LD_ref/ldsc/AUD_afr.sumstats.gz --ref-ld-chr /scratch/silo1/BGA_LAB/dbGaP/GTP/LD_reference/ld_estimate/ --w-ld-chr /scratch/silo1/BGA_LAB/dbGaP/GTP/LD_reference/ld_estimate/  --out /scratch/silo1/BGA_LAB/dbGaP/GTP/PRS_V2/LD_ref/ldsc/test  "
           , intern=TRUE), sep='\n')
In [ ]:
#cat(system(
#"awk -F ' ' '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14}' /scratch/silo1/BGA_LAB/workshop/dataset/1kg_EU_qc.ped | head -n5", 
#intern=TRUE), sep='\n')
In [ ]: